LLaVA Architecture
LLaVA connects a CLIP vision encoder to a large language model (Vicuna or Mistral depending on variant) via a simple MLP projection layer. The vision encoder extracts image features; the projection maps them to the LLM's embedding space; the LLM generates text autoregressively conditioned on both visual tokens and text tokens.
The LLaVA project page documents each architecture iteration. The key insight from the original paper: instruction tuning with GPT-4-generated (image, question, answer) triples dramatically improves visual instruction following, even with a simple connection mechanism.
LLaVA 1.6 Improvements (Dynamic High Resolution)
The LLaVA 1.6 paper introduces dynamic high-resolution processing. Previous LLaVA versions resized all images to 336x336 pixels before encoding — losing fine-grained text, small objects, and chart details. LLaVA 1.6:
- Determines the best grid layout for the input image (e.g., 2x2 for a wide image)
- Splits the image into tiles at native resolution
- Encodes each tile separately with CLIP
- Concatenates tile tokens with a downsampled global view
This 4x increase in effective resolution explains most of the benchmark improvement over LLaVA 1.5.
Loading With Transformers
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16,
device_map="auto"
)
image = Image.open("chart.png")
conversation = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What is the highest value shown in this chart?"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(output[0], skip_special_tokens=True))
MMBench Score vs GPT-4V
On MMBench (a 3000-question visual understanding benchmark), LLaVA-1.6-34B scores within 5 points of GPT-4V across most sub-categories. The HuggingFace model page links to full evaluation results.
Practical Uses and Variants
Document parsing: Extract structured data from invoices, forms, and tables without OCR APIs.
Chart Q&A: Answer questions about data visualizations without manual data entry.
Visual code review: Analyze UI screenshots and suggest improvements.
LLaVA-Next variants span 7B (Mistral backbone), 13B (Vicuna backbone), and 34B (Yi backbone). The 7B variant runs on 16GB VRAM; the 34B requires 80GB or multi-GPU setup. For most document and chart tasks, the 7B delivers adequate accuracy at practical inference cost.