MoE Comes to Vision-Language Models
Mixture of Experts (MoE) architectures have reshaped the LLM landscape by allowing massive total parameter counts with modest active parameter budgets. DeepSeek-VL2 brings this approach to vision-language modeling: 27B total parameters, but only 4.5B are active per token, making inference memory and compute closer to a 5B dense model.
Architecture Overview
DeepSeek-VL2 uses:
- Vision encoder: SigLIP-400M, processing images via dynamic tiling
- MoE language backbone: DeepSeek-V2 MoE architecture — 27B total parameters, 4.5B active via top-K expert routing
- Modality bridge: MLP projector connecting vision encoder outputs to language model inputs
The dynamic tiling for vision handles arbitrary image resolutions by splitting images into tiles matched to the vision encoder's native input size, then encoding each tile independently before concatenating token sequences.
import torch
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images
model_path = "deepseek-ai/deepseek-vl2"
processor = DeepseekVLV2Processor.from_pretrained(model_path)
model = DeepseekVLV2ForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()
conversation = [
{
"role": "User",
"content": "<image>
Extract all text visible in this document and format it as markdown.",
"images": ["document.jpg"],
},
{"role": "Assistant", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = processor(conversations=conversation, images=pil_images, force_batchify=True).to(model.device)
with torch.no_grad():
outputs = model.generate(**prepare_inputs, max_new_tokens=512)
answer = processor.tokenizer.decode(outputs[0][prepare_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)
DeepSeek-VL2 Sizes
Three variants cover different deployment scenarios:
- DeepSeek-VL2-Tiny: 3B total, ~1B active — for edge/constrained deployment
- DeepSeek-VL2-Small: 16B total, 2.8B active — balanced quality/efficiency
- DeepSeek-VL2: 27B total, 4.5B active — highest quality, single A100 80GB
Task Performance
DeepSeek-VL2 excels at OCR-heavy tasks: DocVQA, ChartQA, InfoVQA, and TextVQA. The dynamic tiling preserves fine text in documents better than fixed-resolution models. On MMMU it is competitive with InternVL2-26B despite lower active parameter count.
Comparison to InternVL2
Both models handle high-resolution documents well. InternVL2-26B has a larger active parameter count (26B dense vs 4.5B active for DeepSeek-VL2) which shows on complex reasoning tasks. DeepSeek-VL2 wins on inference throughput at equivalent GPU memory budgets due to MoE sparsity.
DeepSeek Platform API Access
For teams not running their own inference infrastructure, DeepSeek provides API access to VL2 at competitive pricing — making it practical to test before committing to self-hosted deployment.