Moondream2: A 1.9B VLM That Runs on a Raspberry Pi

Moondream2 is a 1.9B parameter vision-language model that fits in 1.2GB RAM when quantized, enabling image captioning, visual Q&A, and object detection on embedded hardware and edge devices.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 2, 2026

7 min read

// tags

#moondream2#edge-ai#vlm#small-model#vision

FIG. ART-28

7 min read

“

Moondream2: A 1.9B VLM That Runs on a Raspberry Pi

// reading plan

sections

375

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format — export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

Why Size Matters for VLMs

Most vision-language models start at 7B parameters. LLaVA-7B requires 14GB VRAM in float16. For applications that need vision understanding without cloud API latency — embedded systems, mobile apps, privacy-sensitive processing — the model must fit in available RAM.

Moondream2 achieves 1.9B parameters through architectural choices that sacrifice breadth for efficiency: a smaller vision encoder, aggressive weight sharing, and training data focused on the most common vision-language tasks.

Hardware Requirements

In 4-bit quantization via GGUF format, Moondream2 requires 1.2GB RAM — enough to run on a Raspberry Pi 5 (8GB model), a mid-range smartphone, or any laptop with integrated graphics. Speed varies:

M2 MacBook Pro (CPU): ~3 seconds per image captioning
Raspberry Pi 5: ~15 seconds per image captioning
RTX 3080 (GPU): ~0.5 seconds per image captioning

The HuggingFace Moondream2 page provides GGUF, safetensors, and ONNX formats.

Python API

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("vikhyatk/moondream2", revision="2025-01-09")

image = Image.open("photo.jpg")
enc_image = model.encode_image(image)

# Image captioning
caption = model.answer_question(enc_image, "Describe this image.", tokenizer)
print(caption)

# Visual Q&A
answer = model.answer_question(enc_image, "How many people are in the image?", tokenizer)
print(answer)

# Object detection (returns bounding boxes)
objects = model.detect(image, "person")
print(objects)  # [{"x_min": 0.2, "y_min": 0.1, "x_max": 0.5, "y_max": 0.9}, ...]

Moondream Server for Batch Inference

The Moondream GitHub includes a FastAPI server that batches image requests and caches vision encodings. For pipelines processing thousands of images, caching the encoded image representation (before the text generation step) reduces compute by ~60% when the same image is queried with multiple questions.

# Start the moondream server
pip install moondream
python -m moondream.server --model 2b-int8

Comparison to LLaVA-7B

| Metric | Moondream2 | LLaVA-7B | |---|---|---| | Parameters | 1.9B | 7B | | RAM (4-bit) | 1.2GB | 4GB | | Image captioning quality | Good | Better | | Object detection | Built-in | Requires prompt tuning | | Edge deployment | Yes | No (too slow) | | VQA accuracy (VQAv2) | ~74% | ~80% |

For edge deployments where LLaVA-7B is impractical, Moondream2 captures most of the value at a fraction of the resource cost. For server-side inference where quality is the priority, LLaVA-7B or larger models are preferable.

Moondream2: A 1.9B VLM That Runs on a Raspberry Pi

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

Why Size Matters for VLMs

Hardware Requirements

Python API

Moondream Server for Batch Inference

Comparison to LLaVA-7B

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

Moondream2: A 1.9B VLM That Runs on a Raspberry Pi

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Decision Trees and Random Forests Explained: When Tree Methods Beat Neural Networks

Why Size Matters for VLMs

Hardware Requirements

Python API

Moondream Server for Batch Inference

Comparison to LLaVA-7B

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs