Ollama is a free, open source runtime that lets you run large language models on your own Mac, Linux, or Windows machine. Installation takes one command, the first model downloads and runs in under 5 minutes, and after that, every query you send stays on your hardware with no API costs. The quality depends on your hardware and the model you choose, but a 7B parameter model on a standard MacBook Pro delivers responses good enough for most real-world coding and writing tasks.
This guide covers everything from installation through production use cases, including specific hardware recommendations and an honest comparison with paid cloud APIs.
What Ollama Is
Ollama is a local LLM runtime. It manages model downloads, serves a local HTTP API (compatible with the OpenAI API format), and handles the low-level details of running models efficiently on your hardware.
You install Ollama once. Then you pull any supported model with a single command. Models run locally on your CPU or GPU. No internet connection required after the initial download, no API keys, no rate limits, no per-query costs.
Ollama is not a model itself. It is the infrastructure that runs models. Think of it as Docker for LLMs: you pull an image (the model), run it, and interact with it through a standard interface.
Installation
Mac and Linux:
curl -fsSL https://ollama.com/install.sh | sh
That is the complete installation. Ollama installs as a background service that starts automatically.
Windows: Download the installer from ollama.com. It installs as a Windows service.
Verify installation:
ollama --version
Running Your First Model
ollama run llama3.3
On first run, this downloads the Llama 3.3 8B model (approximately 4.7GB). After the download, the model starts and you are in an interactive chat interface immediately.
ollama run mistral
ollama run phi3
ollama run gemma2
ollama run qwen2.5
ollama run deepseek-r1
Each command pulls the default size for that model family if you have not downloaded it yet.
Models Available in 2026
| Model | Size | RAM Required | Quality Notes | |---|---|---|---| | Llama 3.3 70B | ~40GB | 64GB RAM | Best open source quality; needs serious hardware | | Llama 3.3 8B | ~4.7GB | 8GB RAM | Good balance; runs on any modern laptop | | Mistral 7B | ~4.1GB | 8GB RAM | Excellent quality-to-size ratio, fast | | Phi-3 Mini (3.8B) | ~2.3GB | 4GB RAM | Surprisingly capable for its size | | Gemma 2 9B | ~5.4GB | 10GB RAM | Strong reasoning, good for analysis | | Qwen 2.5 72B | ~43GB | 64GB RAM | Best open model for coding tasks | | Qwen 2.5 7B | ~4.4GB | 8GB RAM | Good coding performance at 7B scale | | Deepseek-R1 8B | ~4.9GB | 8GB RAM | Reasoning model, slower but more methodical | | Deepseek-R1 70B | ~43GB | 64GB RAM | Best open reasoning model if you have hardware |
Hardware Requirements: What Runs on What
The most common question is whether your hardware can run a useful model. The short answer is yes, almost any laptop can run a 7B parameter model. The longer answer:
Apple Silicon MacBook (M1/M2/M3, 8GB RAM): Runs 7B models comfortably. Phi-3 Mini and Mistral 7B respond in 3-8 seconds per query. For simple tasks, this is fast enough to be usable. The unified memory architecture on Apple Silicon makes 8GB go further than 8GB on Intel/AMD.
Apple Silicon MacBook (M1/M2/M3, 16GB RAM): Runs 13B models without problem. Runs 7B models fast (1-3 seconds per query). This is the sweet spot for local LLM use on a laptop.
Apple Silicon MacBook (M1 Max/M2 Max/M3 Max, 32-64GB RAM): Runs 70B models. Llama 3.3 70B on an M2 Max with 64GB takes 8-15 seconds per query for a paragraph-length response. Slower than cloud APIs but entirely viable for non-time-sensitive tasks.
Linux/Windows with NVIDIA GPU (8GB VRAM): Runs 7B models very fast (under 2 seconds per query). Ollama uses CUDA automatically when a compatible NVIDIA GPU is detected.
Linux/Windows with NVIDIA GPU (24GB+ VRAM, e.g. RTX 3090/4090): Runs 13B-34B models at good speed. A 34B model on an RTX 4090 responds in 4-6 seconds.
Linux server with 2x A100 (80GB VRAM total): Runs 70B models at near-API speeds (2-4 seconds per query). This setup is comparable to hosted API quality and speed.
How to Run and Interact with Models
Interactive chat:
ollama run llama3.3
Type your messages and press Enter. /bye exits.
One-shot query via API:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Write a Python function that validates email addresses",
"stream": false
}'
OpenAI-compatible API (useful for integrations):
Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1. Any tool or library that supports OpenAI's API can point to this endpoint and use your local models instead.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Explain async/await in JavaScript"}]
)
List downloaded models:
ollama list
Remove a model:
ollama rm llama3.3
Use Cases Where Ollama Beats Paid APIs
Code review with sensitive code: Many companies have policies prohibiting source code from being sent to external services. With Ollama, code review stays entirely on your infrastructure. A local Mistral 7B is good enough for most code review tasks (catching obvious bugs, suggesting improvements, explaining logic).
Local knowledge base / RAG: Building a retrieval-augmented generation system over internal documentation? Running the LLM locally means no query data leaves your network. See the companion guide on building an open source RAG stack.
High-volume prototyping: When you are iterating on prompts or testing AI features in development, API costs add up fast. Running 500 test queries against a local model costs nothing. Running them against GPT-4o costs $10-50 depending on length.
Offline work: On a plane, in a location with unreliable internet, or in an air-gapped environment. Local models work without network access once downloaded.
Learning and experimentation: Want to understand how different models respond to the same prompt? Ollama makes it trivial to run the same query against 5 different models in succession, which would cost $2-5 per model via API.
Integration with Open WebUI
Open WebUI provides a ChatGPT-like browser interface that connects to your local Ollama instance. It is the quickest way to give non-technical team members access to a local LLM.
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 and you have a full chat interface that routes to your local Ollama models.
Integration with Continue.dev for Coding
Continue.dev (the open source VS Code coding assistant) can use Ollama as its backend. This gives you free AI autocomplete in your editor.
In Continue.dev's config.json:
{
"models": [{
"title": "Mistral 7B (Local)",
"provider": "ollama",
"model": "mistral"
}],
"tabAutocompleteModel": {
"title": "Qwen 2.5 7B",
"provider": "ollama",
"model": "qwen2.5:7b"
}
}
Completions arrive in 300-600ms on a MacBook with 16GB RAM. Slower than Cursor's cloud-backed completions, but free and private.
Performance vs Cloud APIs
Local 7B model (Mistral, laptop with 16GB RAM) vs GPT-4o-mini (cloud):
| Metric | Local Mistral 7B | GPT-4o-mini | |---|---|---| | Latency (first token) | 800-1500ms | 400-800ms | | Throughput | 15-30 tokens/sec | 60-120 tokens/sec | | Quality on simple tasks | Good | Excellent | | Quality on complex reasoning | Moderate | Very Good | | Cost per query | $0 | ~$0.001-0.005 |
For tasks like summarizing text, explaining code, generating boilerplate, and answering questions about documentation, local 7B models are close enough to GPT-4o-mini that the quality difference does not matter for most use cases. For tasks requiring deep reasoning, multi-step problem solving, or nuanced writing, cloud models remain ahead.
Limitations: Be Honest About What Ollama Cannot Do
No internet access by default. Ollama models run entirely offline. They cannot browse the web, read URLs, or access real-time information. For tasks requiring current information, you need a cloud API or a RAG system with up-to-date documents.
Hardware ceiling on quality. The best open source models require 64GB+ RAM to run. Most developers are using 8-16GB laptops, which limits them to 7B-13B models. These models are genuinely good, but they are not GPT-4o. The quality gap is real, especially for complex reasoning and instruction-following.
Slower inference. A 7B model on a laptop is slower than a cloud API by a factor of 2-8x on response time. For interactive use, this is noticeable.
No fine-tuning in Ollama itself. Ollama runs pre-trained models. If you need fine-tuning on your own data, you need separate tooling (Axolotl, Unsloth, etc.) and then convert the result to GGUF format for Ollama.
Context windows vary. The context window size (how much text the model can process at once) varies by model and the quantization level used. Mistral 7B supports up to 32k tokens. Llama 3.3 supports up to 128k tokens. But running at maximum context window size on a laptop is slow.
Keep Reading
- Best Local LLM in 2026 — Which specific model to run given your hardware
- Building a RAG System With Open Source Tools — Use Ollama as the LLM backend in a local knowledge base
- Open Source Alternatives to GitHub Copilot — Connect Ollama to Continue.dev for free AI coding
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.