What Is Llamafile?
Llamafile is a Mozilla project that solves the distribution problem for local LLMs. The challenge: running open-source models requires installing Python, CUDA drivers, model weights, and dependencies — a setup that takes an hour and fails in dozens of ways.
Llamafile packages the model weights, the inference runtime, and a web UI into a single executable file. On macOS you double-click it. On Windows it opens in a browser. On Linux you chmod +x and run it. No installation required, no Python environment, no GPU drivers needed (though it will use the GPU if available).
The underlying magic is Mozilla's cosmopolitan libc: a polyglot binary format that contains native code for multiple architectures (x86_64, ARM64) and operating systems, and self-selects the right one at launch.
Download and Run
# Download a llamafile (example: Llama 3.2 3B)
wget https://huggingface.co/Mozilla/Llama-3.2-3B-Instruct/resolve/main/Llama-3.2-3B-Instruct.Q6_K.llamafile
# Make executable (Linux/Mac)
chmod +x Llama-3.2-3B-Instruct.Q6_K.llamafile
# Run — opens browser UI automatically
./Llama-3.2-3B-Instruct.Q6_K.llamafile
On Windows: rename to .exe and double-click.
CLI Mode and OpenAI-Compatible Server
Llamafile includes llama.cpp's server mode with an OpenAI-compatible API:
# Start as API server on port 8080
./model.llamafile --server --port 8080 --nobrowser
# Query via curl
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Any application that accepts an OpenAI base URL can point to your local llamafile server.
CLI Inference
# Single prompt, no server
./model.llamafile --cli -p "Explain the difference between TCP and UDP:" --temp 0.7 -n 200
Package Your Own Model
You can bundle any GGUF model into a llamafile:
# Install zipalign
pip install zipalign-n-paste
# Download the llamafile runtime
wget https://github.com/Mozilla-Ocho/llamafile/releases/latest/download/llamafile-0.9.0.zip
unzip llamafile-0.9.0.zip
# Bundle your GGUF model
cp llamafile-0.9.0 my-custom-model.llamafile
zipalign -j0 my-custom-model.llamafile your-model.gguf
# Optionally embed a system prompt
echo "-p "You are a helpful cooking assistant." --temp 0.8" > .args
zipalign -j0 my-custom-model.llamafile .args
The resulting file is a self-contained executable that runs your model with your system prompt, distributable as a single file.
Offline-First
Llamafile has zero network dependencies at runtime. There are no telemetry calls, no model download at startup, no API keys. This makes it suitable for air-gapped environments, enterprise deployments with strict network policies, and personal use where privacy is a priority.
Llamafile vs Ollama
Ollama is better for model management: it has a model library, a pull command, easy switching between models, and automatic model updates. Llamafile is better for distribution: you give someone one file and they can run the model — no Ollama installation, no model pull, nothing. For internal tools or sharing a specific model with non-technical users, llamafile's single-file nature is a major advantage.