What Is TGI?
Text Generation Inference (TGI) is HuggingFace's production-grade server for deploying large language models. It's the backend that powers HuggingFace Inference Endpoints, and it's available as open-source software you can run yourself.
TGI is designed for one thing: serving LLMs at high throughput with low latency. Everything in it — from the batching algorithm to the quantization support — exists to push more tokens per second out of a given GPU.
One-Command Deployment
docker run --gpus all --shm-size 1g -p 8080:80 -v /models:/data ghcr.io/huggingface/text-generation-inference:latest --model-id meta-llama/Meta-Llama-3-8B-Instruct --num-shard 1 --max-input-length 4096 --max-total-tokens 8192
TGI downloads the model weights, loads them onto the GPU, and starts serving an OpenAI-compatible API on port 8080.
OpenAI-Compatible API
TGI implements the OpenAI Messages API, so any code written for OpenAI works without modification:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "What is quantum computing?"}],
max_tokens=500,
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="", flush=True)
Continuous Batching
Standard server implementations process one request at a time. TGI's continuous batching algorithm accepts new requests mid-generation, filling GPU capacity that would otherwise sit idle. This dramatically improves throughput under concurrent load — often 5-10x more tokens per second compared to naive sequential serving.
Tensor Parallelism
For models too large for a single GPU, TGI splits tensor computations across multiple GPUs:
--num-shard 4 # splits the model across 4 GPUs
Quantization Support
TGI natively supports GPTQ, AWQ, and bitsandbytes 4-bit quantization. For GPTQ/AWQ, use a pre-quantized model from the Hub:
--model-id TheBloke/Mistral-7B-Instruct-v0.2-GPTQ --quantize gptq
Speculative Decoding
For tasks where output length is predictable (code completion, structured output), speculative decoding uses a small draft model to propose multiple tokens that the main model validates in one forward pass. This can double effective tokens-per-second for compatible workloads.
TGI vs vLLM
Both are production LLM servers with continuous batching. TGI integrates more tightly with the HuggingFace ecosystem and handles gated models (Llama, Gemma) with better authentication support. vLLM has broader model architecture support (including models not on HuggingFace) and a more active community around PagedAttention research. For standard HuggingFace models in a production setting, TGI is the lower-friction choice.