Beyond Shared Attention Weights
Most vision-language models process visual and text tokens through the same attention and FFN layers. CogVLM2 takes a different approach: a Visual Expert Module adds a separate set of QKV projection weights and FFN weights exclusively for visual tokens. Text tokens are processed normally; visual tokens travel through both the shared weights and the expert weights, giving the model dedicated capacity for visual reasoning.
CogVLM2 Image: Resolution and Architecture
CogVLM2-Image (8B parameters) processes images at up to 1344×1344 pixels — among the highest native resolutions for a model in this size class. The visual expert runs on top of a Llama 3 8B language backbone, with a SigLIP vision encoder handling image tokenization.
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("THUDM/cogvlm2-llama3-chat-19B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"THUDM/cogvlm2-llama3-chat-19B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
image = Image.open("screenshot.png").convert("RGB")
query = "What UI components are visible in this screenshot and what do they do?"
input_by_model = model.build_conversation_input_ids(
tokenizer, query=query, images=[image], template_version="chat"
)
inputs = {
"input_ids": input_by_model["input_ids"].unsqueeze(0).to(model.device),
"token_type_ids": input_by_model["token_type_ids"].unsqueeze(0).to(model.device),
"attention_mask": input_by_model["attention_mask"].unsqueeze(0).to(model.device),
"images": [[input_by_model["images"][0].to(model.device).to(torch.bfloat16)]],
}
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)
CogVLM2-Video: Temporal Understanding
CogVLM2-Video extends the architecture to handle video input by sampling frames at regular intervals and embedding them as a sequence of visual token groups. The model attends across all frames simultaneously, enabling temporal reasoning: tracking object motion, identifying state changes, understanding cause-and-effect across frames.
Supported input formats include direct video files (MP4, AVI) via the video loading utilities in the CogVLM repository, or manually sampled frame tensors.
GLM-4 Language Backbone
The video variant uses a GLM-4 language backbone rather than Llama, providing stronger Chinese language capability alongside English — an advantage for teams building multilingual visual applications.
Benchmark Comparisons
CogVLM2-Image competes closely with InternVL2-8B and LLaVA-1.6-34B on OCRBench and document understanding tasks, while the Visual Expert Module gives it a measurable advantage on tasks requiring fine-grained visual attribute recognition. For video tasks, CogVLM2-Video outperforms open alternatives on EgoSchema and VideoChatGPT benchmarks.
Practical Use Cases
The high image resolution makes CogVLM2 particularly suited for: UI/UX feedback (screenshot analysis), medical image annotation, technical diagram interpretation, and retail product image extraction where small visual details matter.