What Is Ollama and Why It Matters
Ollama is an open-source runtime that lets you pull, run, and serve large language models on your own hardware — Mac, Linux, or Windows — with a single binary. No Python environment, no CUDA headaches to start, no data leaving your machine. For teams with privacy requirements or developers who want sub-100ms latency without paying per token, it is the fastest path from zero to a running model.
Installation
macOS: Download the .dmg from ollama.com or install via Homebrew:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from the Ollama GitHub releases page. WSL2 is supported and recommended if you need GPU acceleration.
Pulling and Running Your First Model
Once installed, pull a model from the Ollama model library:
ollama pull llama3.1:8b
ollama run llama3.1:8b
The run command opens an interactive chat. Type /bye to exit. For a one-shot query:
ollama run llama3.1:8b "Summarize the CAP theorem in two sentences"
REST API at localhost:11434
Ollama exposes a REST API automatically when the daemon is running:
curl http://localhost:11434/api/generate -d '{"model":"llama3.1:8b","prompt":"What is PagedAttention?","stream":false}'
The API docs cover /api/chat, /api/embeddings, /api/pull, and more.
OpenAI-Compatible Endpoint
Ollama ships a /v1/ endpoint that matches the OpenAI API surface exactly. This means any library or tool that works with OpenAI will work with Ollama by changing base_url:
curl http://localhost:11434/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama3.1:8b","messages":[{"role":"user","content":"Hello"}]}'
In Python with the openai package:
pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "What is entropy?"}],
)
print(response.choices[0].message.content)
Modelfile: Custom System Prompts and Defaults
A Modelfile lets you bake in a system prompt, adjust temperature, or set a stop sequence, then ollama create a named model from it:
cat > Modelfile <<'EOF'
FROM llama3.1:8b
SYSTEM "You are a concise technical writer. Never use bullet points."
PARAMETER temperature 0.3
EOF
ollama create tech-writer -f Modelfile
ollama run tech-writer
Docker Compose Example
For a containerised setup with GPU support:
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
ollama_data:
Memory Requirements by Model
| Model | Quantization | VRAM / RAM | |---|---|---| | Llama 3.1 8B | Q4_K_M | ~5 GB | | Llama 3.1 70B | Q4_K_M | ~40 GB | | Mistral 7B | Q4_K_M | ~4.5 GB | | Gemma 2 27B | Q4_K_M | ~16 GB |
CPU-only inference works but is 3–10x slower. For interactive use, aim for at least a 16 GB M-series Mac or an NVIDIA GPU with matching VRAM.
Ollama is the fastest local LLM setup available today. Pair it with Open WebUI for a ChatGPT-style interface, or point any OpenAI SDK at localhost:11434/v1 to drop it into an existing app.