Running open source LLMs in production is technically feasible and cost-competitive with commercial APIs for mid-to-high token volumes, but it requires infrastructure management that adds engineering overhead. The three main serving options are vLLM (best throughput performance, requires CUDA), Ollama (easiest setup, moderate throughput), and Hugging Face TGI (production-grade, Hugging Face's official server). For a team processing 10M+ tokens/month, self-hosting a 7-8B parameter model on a single GPU instance typically costs $200-500/month versus $500-3,000/month for equivalent commercial API usage, depending on the model. Below 10M tokens/month, API pricing is usually more cost-effective once you account for engineering time.
I have run Llama and Mistral models in production for Zlyqor's AI features. Here is the complete picture.
Server Options
vLLM
vLLM (github.com/vllm-project/vllm, 38k+ stars) is the highest-throughput open source LLM inference server. It uses PagedAttention, a custom attention algorithm that eliminates memory waste from KV cache fragmentation. This enables 2-24x higher throughput than naive implementations.
vLLM provides an OpenAI-compatible REST API, so dropping it in as a replacement for OpenAI's API requires only changing the base URL.
Installation:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3 --port 8000
Requirements: NVIDIA GPU with CUDA 11.8+. vLLM does not support CPU inference or AMD GPUs (ROCm support is in development). Minimum 16GB VRAM for Mistral 7B at full precision, 8GB with 4-bit quantization (AWQ or GPTQ).
Throughput on A10G (24GB VRAM): ~1,200 tokens/second for Mistral 7B at batch_size=32.
Ollama
Ollama (github.com/ollama/ollama, 90k+ stars) is the simplest way to run LLMs locally. Install with one command, pull models like Docker images, and get an OpenAI-compatible API.
# Install
curl -fsSL https://ollama.ai/install.sh | sh
# Pull and run a model
ollama pull llama3.2:3b
ollama run llama3.2:3b "What is the capital of France?"
# API server (runs by default at localhost:11434)
curl http://localhost:11434/v1/chat/completions -d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}'
Ollama supports CPU inference (slow) and Apple Metal GPU (fast on Apple Silicon). It is the right choice for development and low-volume production. For high-throughput production, vLLM significantly outperforms Ollama.
Throughput on A10G: ~300 tokens/second for Llama 3.2 3B (substantially lower than vLLM due to different batching).
Hugging Face TGI (Text Generation Inference)
TGI (github.com/huggingface/text-generation-inference, 10k+ stars) is Hugging Face's production LLM server. Used internally by Hugging Face for their hosted models. Feature-rich: continuous batching, tensor parallelism, quantization, token streaming.
docker run --gpus all -v $volume:/data -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest --model-id mistralai/Mistral-7B-Instruct-v0.3
TGI has slightly lower throughput than vLLM in most benchmarks but better support for edge cases and more active support from Hugging Face. For teams already deeply in the Hugging Face ecosystem, TGI is a natural fit.
Hardware Requirements by Model Size
| Model Size | Min GPU VRAM (FP16) | Min GPU VRAM (4-bit) | Recommended GPU | |------------|--------------------|--------------------|-----------------| | 3B | 8GB | 4GB | RTX 3060, T4 | | 7-8B | 16GB | 8GB | RTX 3090, A10G | | 13B | 28GB | 14GB | A100 40GB | | 70B | 140GB | 40GB | 2x A100 80GB | | 405B | 810GB | ~250GB | 4-8x H100 |
For Mistral 7B or Llama 3.1 8B, a single NVIDIA A10G (24GB VRAM) handles full-precision inference comfortably. An RTX 4090 (24GB VRAM) at ~$1,600 handles the same models locally.
Latency vs Commercial API
Measured latencies for a 200-token generation at low concurrency:
- GPT-4o mini: 400-800ms (varies significantly)
- Claude Haiku: 300-600ms
- Groq Llama 3.1 8B: 100-200ms (their LPU is exceptional)
- vLLM Mistral 7B on A10G: 200-400ms
- Ollama Mistral 7B on A10G: 300-600ms
At high concurrency (50+ simultaneous requests), vLLM's batching advantages become significant: it maintains near-linear throughput scaling while naive servers degrade rapidly.
Cost Calculation: API vs Self-Hosting
Commercial API costs for 10M tokens/month (GPT-4o mini pricing at $0.15/1M input + $0.60/1M output, assuming 60/40 input/output split): Approximately $330/month.
Self-hosted vLLM on AWS g5.2xlarge (A10G, ~$1.20/hour on-demand): $864/month for always-on deployment. Use spot instances ($0.40/hour) for batch workloads: $288/month.
Self-hosted on Hetzner dedicated GPU server (RTX 4090, ~$200-250/month): Fixed cost. At 10M tokens/month, cost per token is ~$0.002-2.5/M. Well under API pricing.
The crossover point: if you are generating 10M+ tokens/month consistently, self-hosting on a dedicated GPU server or spot instances is typically cheaper than commercial APIs. Below 10M tokens/month, the engineering overhead of managing GPU infrastructure usually outweighs the cost savings.
Keep Reading
- vLLM Serving Guide — Deep dive into vLLM's PagedAttention and configuration
- Ollama Complete Guide — The full Ollama reference
- Cutting LLM API Costs — Cost optimization strategies for both API and self-hosted
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.