Quantization Explained: How to Run LLMs 4x Cheaper With Minimal Quality Loss

Quantization reduces model weight precision from FP32 to INT4, cutting memory and compute by 4-8x. Q4_K_M is the sweet spot for most use cases — near full quality at a fraction of the size.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#quantization#llm-optimization#gguf#local-llm

FIG. ART-31

8 min read

“

Quantization Explained: How to Run LLMs 4x Cheaper With Minimal Quality Loss

// reading plan

sections

898

words

min read

// Machine Learning

Reducing ML Model Serving Latency for Production

Users abandon features above 300ms. Here is the complete playbook for hitting production latency targets: quantization, batching, caching, hardware selection, and pre-computation.

10 min read

// Machine Learning

ML Model Compression: Pruning, Quantization, and Knowledge Distillation

Quantization reduces the numerical precision of model weights from 32-bit floating point (FP32) down to 16-bit (FP16), 8-bit (INT8), or 4-bit (INT4). The result is a smaller, faster, cheaper model. The tradeoff is some quality degradation, which is minimal at FP16 and INT8 but meaningful at INT4. For most production use cases, Q4_K_M quantization (a 4-bit format with mixed precision for critical layers) offers near full-quality performance at roughly 4x lower memory cost, enabling models to run on less expensive hardware.

What Quantization Actually Does

When a model is trained, its weights are represented as 32-bit floating point numbers. Each weight is a number like 0.37489234. In FP32, that number occupies 4 bytes. A 7-billion-parameter model has 7 billion such weights: 7B × 4 bytes = 28 GB just to store the weights.

To run this model for inference, you need that 28 GB in GPU memory, which requires an A100 80GB GPU ($2-3/hour on cloud).

INT4 quantization maps each weight to the nearest value in a 4-bit integer range (0 to 15). Now each weight occupies 0.5 bytes instead of 4. The same 7B model becomes: 7B × 0.5 bytes = 3.5 GB. It fits comfortably on a 6 GB consumer GPU (RTX 3060, $300-400).

Quantization Levels and Quality Tradeoffs

| Format | Bits per weight | Memory (7B model) | Quality loss | |--------|-----------------|-------------------|--------------| | FP32 | 32 | 28 GB | None (baseline) | | BF16/FP16 | 16 | 14 GB | Negligible (<0.1%) | | INT8 | 8 | 7 GB | Very small (0.5-1%) | | Q5_K_M | ~5.5 | 4.8 GB | Small (1-2%) | | Q4_K_M | ~4.5 | 3.9 GB | Moderate (2-4%) | | Q3_K_M | ~3.5 | 3.1 GB | Noticeable (5-8%) | | Q2_K | ~2.6 | 2.3 GB | Significant (10-15%) |

Quality loss percentages are approximate and task-dependent. Complex reasoning tasks degrade more with quantization than simple classification tasks.

The K-Quant Formats Explained

The Q4_K_M, Q5_K_M nomenclature comes from llama.cpp, the most widely used open source inference engine. The "K" variants use "k-quants" — a method that quantizes different layers at different precisions. Embedding layers and output layers are kept at higher precision because they are more sensitive to quantization error. Middle layers are quantized more aggressively.

The suffix (S, M, L) indicates the size of the "K" grouping, which affects quality:

S (small): smallest file, most quality loss
M (medium): balanced — this is the recommended default
L (large): largest file, least quality loss

Q4_K_M is the standard recommendation for most use cases. It is the sweet spot: roughly 4x smaller than FP16, with quality loss that is imperceptible on most tasks.

Formats for Different Inference Engines

Different inference tools use different quantization formats:

llama.cpp / Ollama: GGUF format (.gguf files). The most hardware-flexible format — runs on CPU, Apple Silicon, and CUDA GPUs. Use Q4_K_M or Q5_K_M from Hugging Face for most cases.

# Download a GGUF model with Ollama (handles this automatically)
ollama pull llama3.1:8b  # Downloads Q4_K_M by default
ollama pull llama3.1:8b-instruct-q8_0  # Explicit Q8 version

vLLM: Supports AWQ and GPTQ formats for GPU inference. AWQ (Activation-aware Weight Quantization) from Lin et al. 2023 produces better quality than naive INT4 by accounting for activation magnitudes during quantization.

pip install vllm
python -m vllm.entrypoints.openai.api_server   --model TheBloke/Llama-2-7B-AWQ   --quantization awq

Transformers (Hugging Face): BitsAndBytes library for 4-bit and 8-bit quantization directly from Python:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4"  # NF4 is better than standard INT4
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

Quantization and Inference Speed

Beyond memory reduction, quantization also speeds up inference. On modern GPUs with INT4 support, quantized models can run 1.5-2x faster than FP16 models of the same architecture. This compounds the hardware advantage: a quantized model fits on cheaper hardware AND runs faster on it.

On Apple Silicon (M1/M2/M3/M4), FP16 inference on the unified memory architecture runs efficiently. Many users find that Q4_K_M on a MacBook M2 with 16-24 GB memory gives comparable performance to a small cloud GPU server.

When Not to Quantize

Avoid INT4 quantization for tasks where precision matters:

Mathematical computation where small errors propagate (chain-of-thought math, numerical analysis)
Tasks requiring careful reasoning over very long contexts (10k+ tokens)
Fine-tuning: always fine-tune in FP16 or BF16, then quantize for inference

For these tasks, use INT8 or FP16 and accept the higher memory requirement.

Benchmarks: Q4_K_M vs. FP16 on Real Tasks

Testing Llama 3.1 70B in Q4_K_M vs. FP16 on MMLU:

FP16: 81.3% accuracy
Q4_K_M: 79.8% accuracy
Quality delta: -1.5 percentage points

Testing on HumanEval (code generation):

FP16: 73.2% pass rate
Q4_K_M: 71.1% pass rate
Quality delta: -2.1 percentage points

For a 1.5-2% quality tradeoff, you get a 4x memory reduction and the ability to run a 70B model on two 24GB GPUs instead of requiring four. For most applications, this tradeoff is highly favorable.

Keep Reading

Local LLM vs. API Cost Comparison — The break-even analysis for running quantized models locally.
Ollama Complete Guide 2026 — How to run quantized models with zero infrastructure setup.
Speculative Decoding Explained — Another inference optimization that combines well with quantization.

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Quantization Explained: How to Run LLMs 4x Cheaper With Minimal Quality Loss

Related Articles

Reducing ML Model Serving Latency for Production

ML Model Compression: Pruning, Quantization, and Knowledge Distillation

What Quantization Actually Does

Quantization Levels and Quality Tradeoffs

The K-Quant Formats Explained

Formats for Different Inference Engines

Quantization and Inference Speed

When Not to Quantize

Benchmarks: Q4_K_M vs. FP16 on Real Tasks

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Semantic Caching: How to Serve LLM Responses Without Calling the API

Quantization Explained: How to Run LLMs 4x Cheaper With Minimal Quality Loss

Related Articles

Reducing ML Model Serving Latency for Production

ML Model Compression: Pruning, Quantization, and Knowledge Distillation

What Quantization Actually Does

Quantization Levels and Quality Tradeoffs

The K-Quant Formats Explained

Formats for Different Inference Engines

Quantization and Inference Speed

When Not to Quantize

Benchmarks: Q4_K_M vs. FP16 on Real Tasks

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Semantic Caching: How to Serve LLM Responses Without Calling the API

The workspace your team
actually needs