Running a language model locally means your data never leaves your machine. No API calls, no cloud storage, no provider data policies to worry about. For source code you cannot send to external services, client data under confidentiality agreements, or any content where network transmission is a risk, local LLMs are the right architectural choice. The quality is lower than the best cloud models, but it is better than many teams expect.
When Local Makes Sense
The case for local models is simple: data sovereignty. When you run a model locally, the model weights live on your disk, inference runs on your CPU or GPU, and inputs and outputs never leave the machine. This satisfies requirements that no cloud API can match.
Concrete use cases where local is the right choice:
Source code under IP agreements. Many software development contracts include clauses where the client owns the code and the developer agrees not to share it with third parties. Sending that code to GPT-4o or Claude technically violates those agreements. Running Llama locally does not.
Client PII you are processing. Medical records, financial data, legal documents, and personal information often cannot be sent to external processors without explicit consent or a data processing agreement. Processing locally sidesteps the compliance question entirely.
Confidential strategy documents. M&A discussions, financial projections, product roadmaps that should not be on any external server.
Regulated industries. Healthcare organizations with PHI, financial services firms with certain client data, government contractors - all have data handling requirements that external APIs may not satisfy.
Local models are not necessary for most tasks. If you are writing a blog post, summarizing a public article, or asking a general coding question, there is no meaningful privacy benefit to running locally. The performance cost is real and should be justified by a genuine privacy requirement.
Setup: Ollama
Ollama is the simplest way to run open models locally. It handles model downloading, quantization, and serving a local API endpoint that is compatible with the OpenAI API format.
Installation:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
Pulling a model:
ollama pull llama3.3 # 40GB - high quality, needs 64GB RAM
ollama pull mistral # 4GB - good quality, needs 8GB RAM
ollama pull qwen2.5-coder # 4GB - optimized for code
ollama pull llama3.2 # 2GB - fast, lower quality
Running a model:
ollama run mistral
Ollama automatically serves a local API on http://localhost:11434. It accepts requests in OpenAI API format, so any client that supports OpenAI's API can be pointed at Ollama with a base URL change.
IDE Integration: Continue.dev
Continue.dev is an open-source AI coding assistant for VS Code and JetBrains IDEs. It integrates with Ollama, OpenAI, Anthropic, and other providers. For local LLM coding, Continue.dev is the most practical setup.
Installation: Install the Continue extension from the VS Code marketplace or JetBrains plugin marketplace.
Configuration for Ollama (in ~/.continue/config.json):
{
"models": [
{
"title": "Mistral Local",
"provider": "ollama",
"model": "mistral",
"apiBase": "http://localhost:11434"
},
{
"title": "Qwen Coder Local",
"provider": "ollama",
"model": "qwen2.5-coder",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen Coder",
"provider": "ollama",
"model": "qwen2.5-coder"
}
}
With this configuration, Continue.dev uses your local Ollama instance for both chat (Cmd+L) and tab completion. No requests go to external APIs.
Performance: What to Expect on Real Hardware
Generation speed in tokens per second (tok/s) is the practical measure of local model performance. Slower than 10 tok/s feels noticeably slow for interactive use. 30 tok/s feels responsive. 60+ tok/s is fast.
Apple Silicon (M-series) Macs - approximate benchmarks:
| Model | Hardware | Speed |
|---|---|---|
| Mistral 7B (4-bit) | M2 MacBook Air (16GB) | 30-45 tok/s |
| Llama 3.2 8B (4-bit) | M2 MacBook Air (16GB) | 28-40 tok/s |
| Llama 3.3 70B (4-bit) | M2 Max (96GB) | 15-25 tok/s |
| Mistral 7B (4-bit) | M3 Pro (36GB) | 40-60 tok/s |
NVIDIA GPU (discrete):
| Model | Hardware | Speed |
|---|---|---|
| Mistral 7B (4-bit) | RTX 3090 (24GB) | 70-100 tok/s |
| Llama 3.3 70B (4-bit) | 2x A100 (80GB each) | 30-50 tok/s |
Apple Silicon is particularly well-suited to local LLMs because the unified memory architecture allows the GPU and CPU to share RAM. A 64GB M2 Max can run Llama 3.3 70B without moving data between VRAM and system RAM - a bottleneck that constrains discrete GPU setups.
Quality Trade-Off: Honest Assessment
The quality gap between local models and frontier cloud models is real. Here is an honest assessment by task type.
Code completion (short completions, <50 lines): Qwen 2.5 Coder 7B and Mistral 7B are genuinely good. For tab completion of straightforward code, the quality difference from GPT-4o is small enough that most developers will not notice it in daily use.
Code generation (implementing functions from description): Mistral 7B is noticeably weaker than GPT-4o for complex implementations. It produces more hallucinated APIs, more logical errors, and struggles more with nuanced requirements. Llama 3.3 70B is significantly closer to GPT-4o quality and is the recommended choice for complex coding tasks if your hardware supports it.
Document summarization: 7B models are adequate for summarizing well-structured documents. They struggle more with ambiguous source material and produce shorter, less nuanced summaries than larger models.
Complex reasoning: 7B models are substantially weaker at multi-step reasoning. For analytical tasks, Llama 3.3 70B is much better, and GPT-4o is still stronger.
Instruction following: Smaller local models are more likely to ignore formatting instructions, miss constraints, and drift from complex prompts. This improves significantly at 70B scale.
The Cost Argument
Once hardware is purchased, local inference has no per-token cost. For teams with high query volumes, the math can favor local models even before privacy considerations:
At GPT-4o pricing ($2.50 input / $10 output per 1M tokens), a team generating 50 million output tokens per month spends $500/month. An M2 Max MacBook costs approximately $3,500 and runs Llama 3.3 70B at usable speed. The hardware pays for itself in 7 months - and the quality of Llama 3.3 70B approaches GPT-4o for many tasks.
For smaller query volumes the economics are less clear. A team generating 5 million output tokens per month spends $50 at GPT-4o pricing. Local hardware does not pay for itself on $50/month in API costs, but the privacy benefit may still justify it.
Recommended Setup by Use Case
Developer needing private code assistance, 16GB MacBook: Ollama + Mistral 7B + Continue.dev. Adequate for tab completion and simple function generation. Step up to a machine with more RAM for complex tasks.
Developer needing private code assistance, 32GB+ Mac: Ollama + Llama 3.2 or Qwen 2.5 Coder + Continue.dev. Good quality for most coding tasks.
Team needing private document processing: Dedicated server with Llama 3.3 70B, Ollama API exposed internally, team clients pointed at internal endpoint.
Security-critical environment: Air-gapped setup with models loaded from local storage, no internet connection required after model download.
Keep Reading
- LLM Privacy for Enterprise - policies and cloud alternatives alongside local options
- Best LLM for Coding 2026 - how local models compare to cloud models for coding
- LLM Context Window Comparison - context window sizes for local vs cloud models
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.