What Changed Between 3.1 and 3.3
Meta released Llama 3.3 70B in November 2024 with two primary improvements over the earlier 3.1 70B:
- Better instruction following data — the SFT (supervised fine-tuning) dataset was heavily curated, removing noisy examples and adding more complex multi-turn conversations
- RLHF updates — improved reward model and DPO (direct preference optimization) passes resulted in stronger alignment to human preferences across writing, coding, and reasoning tasks
The architecture and tokenizer are unchanged from 3.1, which means the same inference infrastructure works without modification.
Benchmark Comparison: 3.3 70B vs 3.1 405B
This is the headline result that surprised many teams when Llama 3.3 dropped:
| Benchmark | Llama 3.1 405B | Llama 3.3 70B | Qwen 2.5 72B | |---|---|---|---| | MMLU | 88.6% | 86.0% | 86.1% | | MATH | 73.8% | 77.0% | 83.1% | | IFEval | 88.6% | 92.1% | 87.0% | | GPQA | 51.1% | 50.7% | 49.0% |
On IFEval (instruction following), 3.3 70B actually outperforms the 405B — the training data improvements are most visible here. On MATH, 3.3 70B matches or beats 405B. The gap only reopens on the most complex reasoning benchmarks where raw parameter count still matters.
Hardware Requirements
- Single A100-80GB: runs 3.3 70B at BF16 with a few GB to spare. This is the minimum comfortable setup for production serving.
- Two A6000-48GB (96GB total): viable with tensor parallelism via vLLM
- M2 Ultra Mac Studio (192GB): runs at roughly 20 tokens/second via llama.cpp or Ollama
- A10G-24GB: too small for BF16; use Q6_K quantization via llama.cpp
ollama pull llama3.3:70b
ollama run llama3.3:70b
In Q4_K_M quantization via Ollama, a 64GB MacBook Pro M2 Max can serve the 70B model at approximately 14 tokens/second — enough for interactive use.
Context Window: 128k
The 128k token context window is one of Llama 3.3 70B's most practical advantages over smaller open-source models. At 128k, you can fit:
- An entire small codebase (20–30 files)
- A full book manuscript for editing
- A month of email thread for summarization
- A large PDF report with all figures described in text
Most tasks that require an entire repository or large document as context now work well with the 70B model rather than requiring the 405B.
Llama 3.3 vs Qwen 2.5 72B
The two models are extremely close in benchmark scores. Practical differences:
- Language: Qwen 2.5 72B is stronger in Chinese and several Asian languages
- Math: Qwen 2.5 72B scores higher on MATH (83.1% vs 77.0%)
- Instruction following: Llama 3.3 edges ahead on English IFEval
- License: both are permissively licensed for commercial use, though Llama's license has user threshold restrictions at very high scale
For most English-language deployments, either model is an excellent choice. The deciding factor is often which ecosystem you are already using (Ollama, vLLM, llama.cpp) and which fine-tunes are available for your use case.