Running Open Source LLMs in Production: What It Actually Takes

vLLM, Ollama, and TGI are the main serving options. Here is hardware requirements, latency comparison, and the cost crossover point where self-hosting beats the API.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

9 min read

// tags

#vllm#ollama#llm-inference#self-hosted-ai

FIG. ART-32

9 min read

“

Running Open Source LLMs in Production: What It Actually Takes

// reading plan

sections

786

words

min read

// Open Source AI

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

OpenCode runs Claude, GPT, Gemini, or local Ollama models in one terminal agent — Claude Code is official, polished, and Anthropic-native. Honest 2026 comparison.

5 min read

// Open Source AI

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

Running open source LLMs in production is technically feasible and cost-competitive with commercial APIs for mid-to-high token volumes, but it requires infrastructure management that adds engineering overhead. The three main serving options are vLLM (best throughput performance, requires CUDA), Ollama (easiest setup, moderate throughput), and Hugging Face TGI (production-grade, Hugging Face's official server). For a team processing 10M+ tokens/month, self-hosting a 7-8B parameter model on a single GPU instance typically costs $200-500/month versus $500-3,000/month for equivalent commercial API usage, depending on the model. Below 10M tokens/month, API pricing is usually more cost-effective once you account for engineering time.

I have run Llama and Mistral models in production for Zlyqor's AI features. Here is the complete picture.

Server Options

vLLM

vLLM (github.com/vllm-project/vllm, 38k+ stars) is the highest-throughput open source LLM inference server. It uses PagedAttention, a custom attention algorithm that eliminates memory waste from KV cache fragmentation. This enables 2-24x higher throughput than naive implementations.

vLLM provides an OpenAI-compatible REST API, so dropping it in as a replacement for OpenAI's API requires only changing the base URL.

Installation:

pip install vllm
python -m vllm.entrypoints.openai.api_server   --model mistralai/Mistral-7B-Instruct-v0.3   --port 8000

Requirements: NVIDIA GPU with CUDA 11.8+. vLLM does not support CPU inference or AMD GPUs (ROCm support is in development). Minimum 16GB VRAM for Mistral 7B at full precision, 8GB with 4-bit quantization (AWQ or GPTQ).

Throughput on A10G (24GB VRAM): ~1,200 tokens/second for Mistral 7B at batch_size=32.

Ollama

Ollama (github.com/ollama/ollama, 90k+ stars) is the simplest way to run LLMs locally. Install with one command, pull models like Docker images, and get an OpenAI-compatible API.

# Install
curl -fsSL https://ollama.ai/install.sh | sh

# Pull and run a model
ollama pull llama3.2:3b
ollama run llama3.2:3b "What is the capital of France?"

# API server (runs by default at localhost:11434)
curl http://localhost:11434/v1/chat/completions   -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}'

Ollama supports CPU inference (slow) and Apple Metal GPU (fast on Apple Silicon). It is the right choice for development and low-volume production. For high-throughput production, vLLM significantly outperforms Ollama.

Throughput on A10G: ~300 tokens/second for Llama 3.2 3B (substantially lower than vLLM due to different batching).

Hugging Face TGI (Text Generation Inference)

TGI (github.com/huggingface/text-generation-inference, 10k+ stars) is Hugging Face's production LLM server. Used internally by Hugging Face for their hosted models. Feature-rich: continuous batching, tensor parallelism, quantization, token streaming.

docker run --gpus all   -v $volume:/data   -p 8080:80   ghcr.io/huggingface/text-generation-inference:latest   --model-id mistralai/Mistral-7B-Instruct-v0.3

TGI has slightly lower throughput than vLLM in most benchmarks but better support for edge cases and more active support from Hugging Face. For teams already deeply in the Hugging Face ecosystem, TGI is a natural fit.

Hardware Requirements by Model Size

Model Size	Min GPU VRAM (FP16)	Min GPU VRAM (4-bit)	Recommended GPU
3B	8GB	4GB	RTX 3060, T4
7-8B	16GB	8GB	RTX 3090, A10G
13B	28GB	14GB	A100 40GB
70B	140GB	40GB	2x A100 80GB
405B	810GB	~250GB	4-8x H100

For Mistral 7B or Llama 3.1 8B, a single NVIDIA A10G (24GB VRAM) handles full-precision inference comfortably. An RTX 4090 (24GB VRAM) at ~$1,600 handles the same models locally.

Running Open Source LLMs in Production: What It Actually Takes

Related Articles

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

Server Options

vLLM

Ollama

Hugging Face TGI (Text Generation Inference)

Hardware Requirements by Model Size

Latency vs Commercial API

Cost Calculation: API vs Self-Hosting

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

ML Deployment Patterns: From REST API to Edge Inference

Running Open Source LLMs in Production: What It Actually Takes

Related Articles

OpenCode vs Claude Code: Open-Source Agentic CLI Compared

Server Options

vLLM

Ollama

Hugging Face TGI (Text Generation Inference)

Hardware Requirements by Model Size

Latency vs Commercial API

Cost Calculation: API vs Self-Hosting

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

DeepSeek V4 Pro and Kimi K2.6 vs Claude Opus 4.8: Open Weights at Frontier Level

ML Deployment Patterns: From REST API to Edge Inference

The workspace your team
actually needs