Quantization reduces the numerical precision of model weights from 32-bit floating point (FP32) down to 16-bit (FP16), 8-bit (INT8), or 4-bit (INT4). The result is a smaller, faster, cheaper model. The tradeoff is some quality degradation, which is minimal at FP16 and INT8 but meaningful at INT4. For most production use cases, Q4_K_M quantization (a 4-bit format with mixed precision for critical layers) offers near full-quality performance at roughly 4x lower memory cost, enabling models to run on less expensive hardware.
What Quantization Actually Does
When a model is trained, its weights are represented as 32-bit floating point numbers. Each weight is a number like 0.37489234. In FP32, that number occupies 4 bytes. A 7-billion-parameter model has 7 billion such weights: 7B × 4 bytes = 28 GB just to store the weights.
To run this model for inference, you need that 28 GB in GPU memory, which requires an A100 80GB GPU ($2-3/hour on cloud).
INT4 quantization maps each weight to the nearest value in a 4-bit integer range (0 to 15). Now each weight occupies 0.5 bytes instead of 4. The same 7B model becomes: 7B × 0.5 bytes = 3.5 GB. It fits comfortably on a 6 GB consumer GPU (RTX 3060, $300-400).
Quantization Levels and Quality Tradeoffs
| Format | Bits per weight | Memory (7B model) | Quality loss | |--------|-----------------|-------------------|--------------| | FP32 | 32 | 28 GB | None (baseline) | | BF16/FP16 | 16 | 14 GB | Negligible (<0.1%) | | INT8 | 8 | 7 GB | Very small (0.5-1%) | | Q5_K_M | ~5.5 | 4.8 GB | Small (1-2%) | | Q4_K_M | ~4.5 | 3.9 GB | Moderate (2-4%) | | Q3_K_M | ~3.5 | 3.1 GB | Noticeable (5-8%) | | Q2_K | ~2.6 | 2.3 GB | Significant (10-15%) |
Quality loss percentages are approximate and task-dependent. Complex reasoning tasks degrade more with quantization than simple classification tasks.
The K-Quant Formats Explained
The Q4_K_M, Q5_K_M nomenclature comes from llama.cpp, the most widely used open source inference engine. The "K" variants use "k-quants" — a method that quantizes different layers at different precisions. Embedding layers and output layers are kept at higher precision because they are more sensitive to quantization error. Middle layers are quantized more aggressively.
The suffix (S, M, L) indicates the size of the "K" grouping, which affects quality:
- S (small): smallest file, most quality loss
- M (medium): balanced — this is the recommended default
- L (large): largest file, least quality loss
Q4_K_M is the standard recommendation for most use cases. It is the sweet spot: roughly 4x smaller than FP16, with quality loss that is imperceptible on most tasks.
Formats for Different Inference Engines
Different inference tools use different quantization formats:
llama.cpp / Ollama: GGUF format (.gguf files). The most hardware-flexible format — runs on CPU, Apple Silicon, and CUDA GPUs. Use Q4_K_M or Q5_K_M from Hugging Face for most cases.
# Download a GGUF model with Ollama (handles this automatically)
ollama pull llama3.1:8b # Downloads Q4_K_M by default
ollama pull llama3.1:8b-instruct-q8_0 # Explicit Q8 version
vLLM: Supports AWQ and GPTQ formats for GPU inference. AWQ (Activation-aware Weight Quantization) from Lin et al. 2023 produces better quality than naive INT4 by accounting for activation magnitudes during quantization.
pip install vllm
python -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-7B-AWQ --quantization awq
Transformers (Hugging Face): BitsAndBytes library for 4-bit and 8-bit quantization directly from Python:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4" # NF4 is better than standard INT4
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B-Instruct",
quantization_config=quantization_config,
device_map="auto"
)
Quantization and Inference Speed
Beyond memory reduction, quantization also speeds up inference. On modern GPUs with INT4 support, quantized models can run 1.5-2x faster than FP16 models of the same architecture. This compounds the hardware advantage: a quantized model fits on cheaper hardware AND runs faster on it.
On Apple Silicon (M1/M2/M3/M4), FP16 inference on the unified memory architecture runs efficiently. Many users find that Q4_K_M on a MacBook M2 with 16-24 GB memory gives comparable performance to a small cloud GPU server.
When Not to Quantize
Avoid INT4 quantization for tasks where precision matters:
- Mathematical computation where small errors propagate (chain-of-thought math, numerical analysis)
- Tasks requiring careful reasoning over very long contexts (10k+ tokens)
- Fine-tuning: always fine-tune in FP16 or BF16, then quantize for inference
For these tasks, use INT8 or FP16 and accept the higher memory requirement.
Benchmarks: Q4_K_M vs. FP16 on Real Tasks
Testing Llama 3.1 70B in Q4_K_M vs. FP16 on MMLU:
- FP16: 81.3% accuracy
- Q4_K_M: 79.8% accuracy
- Quality delta: -1.5 percentage points
Testing on HumanEval (code generation):
- FP16: 73.2% pass rate
- Q4_K_M: 71.1% pass rate
- Quality delta: -2.1 percentage points
For a 1.5-2% quality tradeoff, you get a 4x memory reduction and the ability to run a 70B model on two 24GB GPUs instead of requiring four. For most applications, this tradeoff is highly favorable.
Keep Reading
- Local LLM vs. API Cost Comparison — The break-even analysis for running quantized models locally.
- Ollama Complete Guide 2026 — How to run quantized models with zero infrastructure setup.
- Speculative Decoding Explained — Another inference optimization that combines well with quantization.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.