What Are Inference Endpoints?
HuggingFace Inference Endpoints is a managed service that deploys any model from the Hub (or your private repository) as a REST API. You choose a GPU, configure scaling, and get a private HTTPS endpoint in minutes — no Kubernetes, no Docker setup, no infrastructure management.
Two Deployment Types
Dedicated endpoints — a dedicated GPU instance running only your model. Consistent latency, higher cost. Suited for production traffic.
Serverless Inference API — shared infrastructure, scales to zero, pay per request. Good for development and low-volume production.
Creating an Endpoint
from huggingface_hub import HfApi
api = HfApi()
endpoint = api.create_inference_endpoint(
"my-llama3-endpoint",
repository="meta-llama/Meta-Llama-3-8B-Instruct",
framework="pytorch",
task="text-generation",
accelerator="gpu",
vendor="aws",
region="us-east-1",
type="protected", # private, protected, or public
instance_size="x1",
instance_type="nvidia-a10g",
min_replica=0, # scale to zero when idle
max_replica=2, # scale up under load
)
endpoint.wait_until_running()
print(endpoint.url)
Calling the Endpoint
from huggingface_hub import InferenceClient
client = InferenceClient(model="https://your-endpoint-url.huggingface.cloud")
response = client.text_generation(
"Explain gradient descent in plain English.",
max_new_tokens=300,
temperature=0.7,
)
print(response)
GPU Options
Available instance types include T4 (16GB, $0.60/hr), A10G (24GB, $1.30/hr), A100 40GB ($3.00/hr), and A100 80GB ($4.50/hr). For LLMs, A10G handles 7-13B models comfortably; A100 is needed for 34B+ models.
Text Generation Inference (TGI) Backend
For LLM deployments, Inference Endpoints automatically uses TGI as the backend. TGI provides:
- Continuous batching — serves multiple requests simultaneously, dramatically improving throughput
- Quantization — GPTQ, AWQ, and bitsandbytes load in 4-bit automatically
- OpenAI-compatible API — same request/response format as the OpenAI chat completions API
Private VPC with PrivateLink
For enterprise deployments where traffic must not leave your VPC, Inference Endpoints supports AWS PrivateLink and Azure Private Link. Traffic from your VPC to the endpoint never traverses the public internet.
Custom Containers
If you need dependencies not in the standard TGI image, you can specify a custom Docker image hosted on your registry.
Inference Endpoints vs Replicate vs Modal
Inference Endpoints has the deepest integration with the HuggingFace Hub and the best support for gated models (Llama, Gemma). Replicate has a broader model catalog including non-HF models and better support for diffusion models. Modal gives more control over the execution environment and supports arbitrary Python code beyond just model inference. For standard HF model deployment in a VPC, Inference Endpoints is the most straightforward path.