What Quantization Does
A Llama 3.1 70B model in float32 takes 280 GB of RAM — impractical on any consumer hardware. Quantization reduces the precision of model weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers. The result: dramatically lower memory requirements with a small, usually acceptable quality loss.
GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp and tools like Ollama and LM Studio. It stores quantized weights with metadata about the quantization scheme.
Quantization Levels Explained
| Level | Bits | Method | Quality | Notes | |---|---|---|---|---| | Q4_0 | 4 | Simple block quant | Moderate | Fastest, lowest quality | | Q4_K_M | 4 | K-quant, medium | Good | Best 4-bit for most use cases | | Q4_K_S | 4 | K-quant, small | Moderate | Smaller than Q4_K_M | | Q5_K_M | 5 | K-quant, medium | Very good | ~20% more RAM than Q4_K_M | | Q6_K | 6 | K-quant | Near-lossless | Excellent for 13B and smaller | | Q8_0 | 8 | 8-bit block quant | Near-perfect | 2x size of Q4, minimal quality loss | | F16 | 16 | Half precision | Reference | ~2x Q8_0, baseline quality |
K-quant methods (K_M, K_S, K_L) use a smarter quantization scheme than naive Q4_0: they group weights into blocks and allocate higher precision to weights that matter more (typically attention layers). Q4_K_M is the community default for 4-bit inference.
Memory Requirements by Model Size and Quant
| Model | Q4_K_M | Q5_K_M | Q8_0 | F16 | |---|---|---|---|---| | 7B | 4.1 GB | 5.0 GB | 7.7 GB | 14 GB | | 13B | 7.4 GB | 9.0 GB | 14 GB | 26 GB | | 34B | 19 GB | 23 GB | 36 GB | 68 GB | | 70B | 40 GB | 48 GB | 75 GB | 140 GB |
Quality Impact on Benchmarks
On standard benchmarks (MMLU, HellaSwag, TruthfulQA), Q4_K_M loses 1–3% relative to F16 for 7B/13B models and 0.5–1.5% for 70B models. The larger the model, the more quantization-resistant it is — a Q4_K_M Llama 70B often outperforms an F16 Llama 13B despite using similar RAM.
Q8_0 is effectively lossless for most benchmarks — quality within 0.1–0.5% of F16 at half the memory.
GGUF vs GPTQ vs AWQ
| Format | Ecosystem | GPU support | CPU support | |---|---|---|---| | GGUF | llama.cpp, Ollama, LM Studio | Yes (partial offload) | Yes | | GPTQ | AutoGPTQ, vLLM | GPU only | No | | AWQ | vLLM, AutoAWQ | GPU only | No |
GGUF is unique in supporting CPU inference and partial GPU offloading — run a 70B model on a 16 GB GPU + CPU RAM combined.
How to Pick the Right Quantization
- Consumer GPU with 8 GB VRAM (RTX 4070, 3070): Q4_K_M for 7B or 8B models
- 16 GB VRAM (A4000, RTX 3090): Q4_K_M for 13B, or Q5_K_M for 7B if quality matters
- 24 GB VRAM (RTX 3090, A5000): Q4_K_M for 34B, or Q8_0 for 13B
- Apple M2/M3 36 GB unified: Q4_K_M for 70B or Q8_0 for 34B
- 2x A100 80 GB: F16 70B
Find GGUF models at HuggingFace — filter by model family and look for publishers like Bartowski or TheBloke.