Ollama Complete Guide 2026: Run Any LLM Locally in 5 Minutes
Ollama lets you run Llama 3.3, Mistral, Phi-3, and Deepseek-R1 on your own hardware for free. Complete setup guide, hardware requirements, and real use cases.
Ollama is a free, open source runtime that lets you run large language models on your own Mac, Linux, or Windows machine. Installation takes one command, the first model downloads and runs in under 5 minutes, and after that, every query you send stays on your hardware with no API costs. The quality depends on your hardware and the model you choose, but a 7B parameter model on a standard MacBook Pro delivers responses good enough for most real-world coding and writing tasks.
This guide covers everything from installation through production use cases, including specific hardware recommendations and an honest comparison with paid cloud APIs.
What Is Ollama?
Ollama is a local LLM runtime. It manages model downloads, serves a local HTTP API (compatible with the OpenAI API format), and handles the low-level details of running models efficiently on your hardware.
You install Ollama once. Then you pull any supported model with a single command. Models run locally on your CPU or GPU. No internet connection required after the initial download, no API keys, no rate limits, no per-query costs.
Ollama is not a model itself. It is the infrastructure that runs models. Think of it as Docker for LLMs: you pull an image (the model), run it, and interact with it through a standard interface.
Installation
Mac and Linux:
curl -fsSL https://ollama.com/install.sh | sh
That is the complete installation. Ollama installs as a background service that starts automatically.
Windows: Download the installer from ollama.com. It installs as a Windows service.
Verify installation:
ollama --version
// stay current
AI & ML insights, weekly
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
On first run, this downloads the Llama 3.3 8B model (approximately 4.7GB). After the download, the model starts and you are in an interactive chat interface immediately.
ollama run mistral
ollama run phi3
ollama run gemma2
ollama run qwen2.5
ollama run deepseek-r1
Each command pulls the default size for that model family if you have not downloaded it yet.
Models Available in 2026
Model
Size
RAM Required
Quality Notes
Llama 3.3 70B
~40GB
64GB RAM
Best open source quality; needs serious hardware
Llama 3.3 8B
~4.7GB
8GB RAM
Good balance; runs on any modern laptop
Mistral 7B
~4.1GB
8GB RAM
Excellent quality-to-size ratio, fast
Phi-3 Mini (3.8B)
~2.3GB
4GB RAM
Surprisingly capable for its size
Gemma 2 9B
~5.4GB
10GB RAM
Strong reasoning, good for analysis
Qwen 2.5 72B
~43GB
64GB RAM
Best open model for coding tasks
Qwen 2.5 7B
~4.4GB
8GB RAM
Good coding performance at 7B scale
Deepseek-R1 8B
~4.9GB
8GB RAM
Reasoning model, slower but more methodical
Deepseek-R1 70B
~43GB
64GB RAM
Best open reasoning model if you have hardware
Hardware Requirements: What Runs on What
The most common question is whether your hardware can run a useful model. The short answer is yes, almost any laptop can run a 7B parameter model. The longer answer:
Apple Silicon MacBook (M1/M2/M3, 8GB RAM):
Runs 7B models comfortably. Phi-3 Mini and Mistral 7B respond in 3-8 seconds per query. For simple tasks, this is fast enough to be usable. The unified memory architecture on Apple Silicon makes 8GB go further than 8GB on Intel/AMD.
Apple Silicon MacBook (M1/M2/M3, 16GB RAM):
Runs 13B models without problem. Runs 7B models fast (1-3 seconds per query). This is the sweet spot for local LLM use on a laptop.
Apple Silicon MacBook (M1 Max/M2 Max/M3 Max, 32-64GB RAM):
Runs 70B models. Llama 3.3 70B on an M2 Max with 64GB takes 8-15 seconds per query for a paragraph-length response. Slower than cloud APIs but entirely viable for non-time-sensitive tasks.
Linux/Windows with NVIDIA GPU (8GB VRAM):
Runs 7B models very fast (under 2 seconds per query). Ollama uses CUDA automatically when a compatible NVIDIA GPU is detected.
Linux/Windows with NVIDIA GPU (24GB+ VRAM, e.g. RTX 3090/4090):
Runs 13B-34B models at good speed. A 34B model on an RTX 4090 responds in 4-6 seconds.
Linux server with 2x A100 (80GB VRAM total):
Runs 70B models at near-API speeds (2-4 seconds per query). This setup is comparable to hosted API quality and speed.
How to Run and Interact with Models
Interactive chat:
ollama run llama3.3
Type your messages and press Enter. /bye exits.
One-shot query via API:
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Write a Python function that validates email addresses",
"stream": false
}'
OpenAI-compatible API (useful for integrations):
Ollama serves an OpenAI-compatible endpoint at http://localhost:11434/v1. Any tool or library that supports OpenAI's API can point to this endpoint and use your local models instead.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="mistral",
messages=[{"role": "user", "content": "Explain async/await in JavaScript"}]
)
List downloaded models:
ollama list
Remove a model:
ollama rm llama3.3
Use Cases Where Ollama Beats Paid APIs
Code review with sensitive code: Many companies have policies prohibiting source code from being sent to external services. With Ollama, code review stays entirely on your infrastructure. A local Mistral 7B is good enough for most code review tasks (catching obvious bugs, suggesting improvements, explaining logic).
Local knowledge base / RAG: Building a retrieval-augmented generation system over internal documentation? Running the LLM locally means no query data leaves your network. See the companion guide on building an open source RAG stack.
High-volume prototyping: When you are iterating on prompts or testing AI features in development, API costs add up fast. Running 500 test queries against a local model costs nothing. Running them against GPT-4o costs $10-50 depending on length.
Offline work: On a plane, in a location with unreliable internet, or in an air-gapped environment. Local models work without network access once downloaded.
Learning and experimentation: Want to understand how different models respond to the same prompt? Ollama makes it trivial to run the same query against 5 different models in succession, which would cost $2-5 per model via API.
Integration with Open WebUI
Open WebUI provides a ChatGPT-like browser interface that connects to your local Ollama instance. It is the quickest way to give non-technical team members access to a local LLM.
Completions arrive in 300-600ms on a MacBook with 16GB RAM. Slower than Cursor's cloud-backed completions, but free and private.
Performance vs Cloud APIs
Local 7B model (Mistral, laptop with 16GB RAM) vs GPT-4o-mini (cloud):
Metric
Local Mistral 7B
GPT-4o-mini
Latency (first token)
800-1500ms
400-800ms
Throughput
15-30 tokens/sec
60-120 tokens/sec
Quality on simple tasks
Good
Excellent
Quality on complex reasoning
Moderate
Very Good
Cost per query
$0
~$0.001-0.005
For tasks like summarizing text, explaining code, generating boilerplate, and answering questions about documentation, local 7B models are close enough to GPT-4o-mini that the quality difference does not matter for most use cases. For tasks requiring deep reasoning, multi-step problem solving, or nuanced writing, cloud models remain ahead.
Limitations: Be Honest About What Ollama Cannot Do
No internet access by default. Ollama models run entirely offline. They cannot browse the web, read URLs, or access real-time information. For tasks requiring current information, you need a cloud API or a RAG system with up-to-date documents.
Hardware ceiling on quality. The best open source models require 64GB+ RAM to run. Most developers are using 8-16GB laptops, which limits them to 7B-13B models. These models are genuinely good, but they are not GPT-4o. The quality gap is real, especially for complex reasoning and instruction-following.
Slower inference. A 7B model on a laptop is slower than a cloud API by a factor of 2-8x on response time. For interactive use, this is noticeable.
No fine-tuning in Ollama itself. Ollama runs pre-trained models. If you need fine-tuning on your own data, you need separate tooling (Axolotl, Unsloth, etc.) and then convert the result to GGUF format for Ollama.
Context windows vary. The context window size (how much text the model can process at once) varies by model and the quantization level used. Mistral 7B supports up to 32k tokens. Llama 3.3 supports up to 128k tokens. But running at maximum context window size on a laptop is slow.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Frequently Asked Questions
What is Ollama?
Ollama is a free, open source runtime that lets you run large language models (LLMs) locally on your own hardware. It manages model downloads, provides a local API compatible with OpenAI's format, and handles efficient inference on CPU or GPU. You install it once, pull a model with a single command, and run queries without internet, API keys, or per-query costs.
How does Ollama work?
Ollama works by downloading model files (typically in GGUF format) and running them locally using optimized inference engines. It provides a command-line interface for interactive chat and a REST API for programmatic access. The API is compatible with the OpenAI API format, so you can point any OpenAI-compatible client to your local Ollama instance. Models run entirely on your hardware, and no data leaves your machine.
What are the best practices for using Ollama?
Best practices include: 1) Choose a model that fits your hardware — 7B models for 8GB RAM, 13B for 16GB, 70B for 64GB+. 2) Use quantized models (e.g., Q4_K_M) to reduce memory usage with minimal quality loss. 3) For production, run Ollama as a systemd service and use the API with retry logic. 4) For sensitive data, always use local models to keep data private. 5) Combine with Open WebUI for a ChatGPT-like interface or Continue.dev for AI coding.
How much does Ollama cost?
Ollama itself is completely free and open source. There are no subscription fees, API costs, or usage limits. The only cost is the hardware you already own — your laptop or server. Running models locally incurs electricity costs, but no per-query charges. This makes it ideal for high-volume prototyping, offline use, and privacy-sensitive applications.
Is Ollama worth it in 2026?
Yes, Ollama is worth it in 2026 for many use cases. Local models like Llama 3.3 8B and Mistral 7B are now good enough for coding assistance, summarization, and content generation. For privacy, offline work, or high-volume prototyping, Ollama beats cloud APIs. However, for complex reasoning or tasks requiring the latest information, cloud models like GPT-4o still outperform. The choice depends on your need for quality vs. privacy and cost.