Architecture Innovations Over Gemma 1
Gemma 2 introduces two architectural departures that distinguish it from most open-source models:
1. Alternating Local/Global Attention (Sliding Window) Rather than applying full self-attention to every layer, Gemma 2 alternates between local sliding window attention (attending to nearby tokens only) and global attention. This reduces quadratic complexity for long sequences while preserving the global context that matters for reasoning.
2. Logit Soft-Capping
Raw attention logits are passed through a tanh soft-cap before the softmax, preventing extreme attention weights that can destabilize training. This technique improved training stability and enabled longer training runs without divergence.
Knowledge Distillation From Gemini
Gemma 2 27B was trained using knowledge distillation from Google's Gemini models — not just standard next-token prediction on text. The model learns to match Gemini's output distributions, transferring capability from a much larger teacher into a 27B student.
Result: MMLU 75.2% vs Llama 3 70B at 73.1%. A model less than half the size, beating the leading open-source 70B model on a flagship knowledge benchmark.
Framework Support
Unlike some models locked to a single framework, Gemma 2 officially supports:
- Keras (with JAX/TensorFlow/PyTorch backends)
- JAX directly
- PyTorch via HuggingFace transformers
- Ollama for local inference
# Install via Ollama
ollama pull gemma2:27b
ollama run gemma2:27b "Explain the vanishing gradient problem."
# Smaller variants
ollama pull gemma2:9b
ollama pull gemma2:2b
HuggingFace Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-27b-it",
device_map="auto",
torch_dtype=torch.bfloat16
)
input_ids = tokenizer(
"Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes.",
return_tensors="pt"
).to("cuda")
output = model.generate(**input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Kaggle Models Hub
Google hosts Gemma 2 on Kaggle Models with one-click fine-tuning notebooks — useful for teams that want to adapt the model to domain-specific tasks without managing infrastructure.
Benchmark Summary
| Model | MMLU | MT-Bench | HumanEval | Params | |-------|------|----------|-----------|--------| | Gemma 2 27B | 75.2% | 7.9 | 72.0% | 27B | | Llama 3 70B | 73.1% | 8.1 | 81.7% | 70B | | Gemma 2 9B | 71.3% | 7.3 | 54.9% | 9B |
Gemma 2 27B wins on knowledge (MMLU) but trails on code (HumanEval). The 9B is a strong choice when 27B is too large.
Summary
Gemma 2 27B demonstrates that architectural innovation and knowledge distillation can overcome raw parameter count. It's Google's most capable openly released model and the best option in the 10-30B parameter class. Get weights at HuggingFace or experiment via Kaggle.