Closing the Gap With GPT-4V
InternVL 2 from Shanghai AI Lab is the most competitive open-source VLM family on multimodal benchmarks as of mid-2024. The 26B variant scores 61.2% on MMMU (Massive Multidisciplinary Multimodal Understanding), compared to GPT-4V's 63.1% — a gap small enough that for most practical tasks, the open model is the rational choice.
Architecture: InternViT-6B + InternLM2
InternVL2 uses a purpose-built vision encoder: InternViT-6B, trained with contrastive learning on large-scale image-text pairs. Most competing models use SigLIP or CLIP encoders in the 300M–400M parameter range. A 6B vision encoder captures substantially more visual detail and transfers better to complex scenes, dense text, and technical diagrams.
The language backbone is InternLM2, available in 2B, 7B, and 20B variants, giving the full model a range of sizes from 2B (InternVL2-2B) to 76B (InternVL2-76B with InternLM2-70B).
Dynamic High-Resolution Tiling
InternVL2 processes high-resolution images by dynamically splitting them into tiles of up to 448×448 pixels each. A 4K image can be represented with up to 40 tiles, preserving fine details in dense text, charts, and technical schematics without resizing artifacts.
import torch
from transformers import AutoModel, AutoTokenizer
from PIL import Image
model = AutoModel.from_pretrained(
"OpenGVLab/InternVL2-26B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("OpenGVLab/InternVL2-26B", trust_remote_code=True)
image = Image.open("technical_diagram.png")
question = "<image>
Describe all the components and their connections in this diagram."
response = model.chat(tokenizer, image, question, generation_config={"max_new_tokens": 512})
print(response)
Benchmark Results Across Sizes
| Model | MMMU | DocVQA | ChartQA | |-------|------|--------|---------| | InternVL2-2B | 36.3% | 86.9% | 76.2% | | InternVL2-8B | 51.2% | 91.6% | 83.3% | | InternVL2-26B | 61.2% | 92.9% | 87.2% | | GPT-4V | 63.1% | 88.4% | 78.5% |
Note that InternVL2 outperforms GPT-4V on DocVQA and ChartQA while being within 2 points on MMMU.
Production Deployment With LMDeploy
For high-throughput serving, LMDeploy provides an optimized backend for InternVL2:
pip install lmdeploy
lmdeploy serve api_server OpenGVLab/InternVL2-26B --tp 2 --port 8080
This enables tensor-parallel serving across multiple GPUs with an OpenAI-compatible API.
Choosing a Size
InternVL2-8B fits on a single A100 40GB and covers most document/chart tasks adequately. InternVL2-26B is worth the additional GPU memory for scientific paper understanding, dense OCR, and math-heavy visuals. The 76B variant is for research labs with multi-GPU infrastructure.