CogVLM2: Open-Source Video and Image Understanding With Long Context

Zhipu AI's CogVLM2 introduces a Visual Expert Module that gives visual tokens their own weight matrices, enabling richer image and video understanding than shared-weight alternatives.

Mahmudul Haque Qudrati

CEO & ML Engineer

April 4, 2026

7 min read

// tags

#cogvlm2#video-understanding#zhipu-ai#vlm#multimodal

FIG. ART-27

7 min read

“

CogVLM2: Open-Source Video and Image Understanding With Long Context

// reading plan

sections

372

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Prompt Engineering

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Beyond Shared Attention Weights

Most vision-language models process visual and text tokens through the same attention and FFN layers. CogVLM2 takes a different approach: a Visual Expert Module adds a separate set of QKV projection weights and FFN weights exclusively for visual tokens. Text tokens are processed normally; visual tokens travel through both the shared weights and the expert weights, giving the model dedicated capacity for visual reasoning.

CogVLM2 Image: Resolution and Architecture

CogVLM2-Image (8B parameters) processes images at up to 1344×1344 pixels - among the highest native resolutions for a model in this size class. The visual expert runs on top of a Llama 3 8B language backbone, with a SigLIP vision encoder handling image tokenization.

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("THUDM/cogvlm2-llama3-chat-19B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "THUDM/cogvlm2-llama3-chat-19B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

image = Image.open("screenshot.png").convert("RGB")
query = "What UI components are visible in this screenshot and what do they do?"

input_by_model = model.build_conversation_input_ids(
    tokenizer, query=query, images=[image], template_version="chat"
)
inputs = {
    "input_ids": input_by_model["input_ids"].unsqueeze(0).to(model.device),
    "token_type_ids": input_by_model["token_type_ids"].unsqueeze(0).to(model.device),
    "attention_mask": input_by_model["attention_mask"].unsqueeze(0).to(model.device),
    "images": [[input_by_model["images"][0].to(model.device).to(torch.bfloat16)]],
}
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

CogVLM2: Open-Source Video and Image Understanding With Long Context

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Beyond Shared Attention Weights

CogVLM2 Image: Resolution and Architecture

CogVLM2-Video: Temporal Understanding

GLM-4 Language Backbone

Benchmark Comparisons

Practical Use Cases

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

CogVLM2: Open-Source Video and Image Understanding With Long Context

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Beyond Shared Attention Weights

CogVLM2 Image: Resolution and Architecture

CogVLM2-Video: Temporal Understanding

GLM-4 Language Backbone

Benchmark Comparisons

Practical Use Cases

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Supervised Learning Explained: How Models Learn from Labeled Examples

The workspace your team
actually needs