What Is Replicate?
Replicate is a platform that hosts open-source ML models and exposes them as simple HTTP APIs. Instead of spinning up your own GPU server to run Llama 3 or Stable Diffusion, you call Replicate's API and pay only for the GPU seconds your prediction uses.
The catalog spans LLMs, image generation, video generation, audio transcription, and specialized research models — over 50,000 models as of 2026.
Basic API Call
import replicate
output = replicate.run(
"meta/meta-llama-3-8b-instruct",
input={
"prompt": "Explain the difference between TCP and UDP in one paragraph.",
"max_tokens": 512,
"temperature": 0.7,
}
)
print("".join(output))
The same pattern works for image generation:
output = replicate.run(
"stability-ai/sdxl:7762fd07cf82c948538e41f4a0dc12b29a59c20bab4a6f34fc40693be6ce41c1",
input={
"prompt": "A futuristic city at sunset, photorealistic, 4K",
"width": 1024,
"height": 1024,
"num_outputs": 1,
}
)
print(output[0]) # URL to generated image
Pricing Model
Replicate charges per second of GPU time, billed at the end of each prediction:
- Nvidia T4: $0.00011/second
- Nvidia A40 (Large): $0.000725/second
- Nvidia A100 (80GB): $0.00115/second
A Llama 3 8B inference typically takes 3-8 seconds on an A40, costing $0.002-$0.006. For low-volume use cases, this is dramatically cheaper than a dedicated GPU server.
Deployments for Low Latency
The standard API has variable cold start times (2-30 seconds depending on model size). For production use cases that need consistent latency, Replicate Deployments keep a warm instance running:
deployment = replicate.deployments.get("my-org/my-model")
prediction = deployment.predictions.create(
input={"prompt": "Hello"},
)
prediction.wait()
Fine-Tuning on Replicate
Replicate supports fine-tuning SDXL and Flux with your own images through a training API:
training = replicate.trainings.create(
model="stability-ai/sdxl",
version="39ed52f2319...",
input={
"input_images": "https://example.com/your-images.zip",
"token_string": "TOK",
"caption_prefix": "a photo of TOK",
"max_train_steps": 1000,
},
destination="my-org/my-sdxl-model",
)
The resulting model is private to your account and can be run via the normal prediction API.
Packaging Custom Models with Cog
Cog is Replicate's open-source tool for packaging ML models as reproducible containers. If you have a custom model not on Replicate, you wrap it with Cog and push it:
pip install cog
cog init
# Edit cog.yaml with your dependencies and predict.py with your inference code
cog push r8.im/your-username/your-model
Replicate vs HuggingFace Inference API
Both provide hosted model inference. HuggingFace Inference API has better support for fine-tuned models from the Hub and dedicated endpoints with more configuration options. Replicate has a better community model catalog, more image/video models, and simpler pricing. For general open-source LLMs, both are comparable; for diffusion models and research models, Replicate's catalog is broader.