DeepSeek-VL2: Efficient Vision-Language Model With Mixture of Experts

DeepSeek-VL2 applies Mixture of Experts to vision-language modeling, activating only 4.5B of 27B parameters per forward pass while matching models twice its active size.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 6, 2026

7 min read

// tags

#deepseek-vl2#vision#moe#deepseek#multimodal

FIG. ART-25

7 min read

“

DeepSeek-VL2: Efficient Vision-Language Model With Mixture of Experts

// reading plan

sections

363

words

min read

// LLMs & Language Models

DeepSeek-R1: Architectures, Training Methods, and Why Reasoning Models Matter

An in-depth look at reinforcement learning, Chain-of-Thought reasoning, and why DeepSeek-R1 represents a shift in LLM capabilities and cost.

12 min read

// LLMs & Language Models

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

MoE Comes to Vision-Language Models

Mixture of Experts (MoE) architectures have reshaped the LLM landscape by allowing massive total parameter counts with modest active parameter budgets. DeepSeek-VL2 brings this approach to vision-language modeling: 27B total parameters, but only 4.5B are active per token, making inference memory and compute closer to a 5B dense model.

Architecture Overview

DeepSeek-VL2 uses:

Vision encoder: SigLIP-400M, processing images via dynamic tiling
MoE language backbone: DeepSeek-V2 MoE architecture - 27B total parameters, 4.5B active via top-K expert routing
Modality bridge: MLP projector connecting vision encoder outputs to language model inputs

The dynamic tiling for vision handles arbitrary image resolutions by splitting images into tiles matched to the vision encoder's native input size, then encoding each tile independently before concatenating token sequences.

import torch
from transformers import AutoModelForCausalLM
from deepseek_vl2.models import DeepseekVLV2Processor, DeepseekVLV2ForCausalLM
from deepseek_vl2.utils.io import load_pil_images

model_path = "deepseek-ai/deepseek-vl2"
processor = DeepseekVLV2Processor.from_pretrained(model_path)
model = DeepseekVLV2ForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).cuda().eval()

conversation = [
    {
        "role": "User",
        "content": "<image>
Extract all text visible in this document and format it as markdown.",
        "images": ["document.jpg"],
    },
    {"role": "Assistant", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = processor(conversations=conversation, images=pil_images, force_batchify=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**prepare_inputs, max_new_tokens=512)

answer = processor.tokenizer.decode(outputs[0][prepare_inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(answer)

DeepSeek-VL2: Efficient Vision-Language Model With Mixture of Experts

Related Articles

DeepSeek-R1: Architectures, Training Methods, and Why Reasoning Models Matter

MoE Comes to Vision-Language Models

Architecture Overview

DeepSeek-VL2 Sizes

Task Performance

Comparison to InternVL2

DeepSeek Platform API Access

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

ONNX: Export Any ML Model and Run It Anywhere

DeepSeek-VL2: Efficient Vision-Language Model With Mixture of Experts

Related Articles

DeepSeek-R1: Architectures, Training Methods, and Why Reasoning Models Matter

MoE Comes to Vision-Language Models

Architecture Overview

DeepSeek-VL2 Sizes

Task Performance

Comparison to InternVL2

DeepSeek Platform API Access

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Local LLMs in 2026: Comparing Llama 3.3, Mistral Large, and DeepSeek-R1

ONNX: Export Any ML Model and Run It Anywhere

The workspace your team
actually needs