A multimodal LLM is a model that processes multiple input types — text, images, audio, video, or documents — in a single model without separate processing pipelines. GPT-4o handles text, images, and audio. Claude 3.5 Sonnet handles text, images, and PDFs. Gemini 1.5 Pro adds native video understanding. Using multimodal models simplifies architecture and enables reasoning that crosses modalities.
Current Capabilities by Model
GPT-4o
GPT-4o (the "o" stands for "omni") is OpenAI's natively multimodal flagship. Inputs it accepts:
- Text (obviously)
- Images: screenshots, photos, diagrams, charts, handwriting
- Audio: real-time voice conversation with notably natural latency
GPT-4o's vision is strong for charts, code screenshots, UI mockups, and handwritten text. Its audio mode is the most natural-sounding and low-latency voice interface available in any commercial API.
Limitation: GPT-4o does not accept video files. For video, you would need to extract frames and submit as images.
Claude 3.5 Sonnet and Opus
Anthropic's models accept text, images, and PDF documents. The PDF support is a meaningful differentiator: you can upload a PDF directly (up to 100 pages) and Claude reads both the text and embedded images natively, without external OCR or text extraction.
Claude 3.5 Sonnet is particularly strong at detailed image analysis and following complex visual instructions. Its vision quality for technical diagrams and dense charts is competitive with GPT-4o.
Gemini 1.5 Pro
Gemini 1.5 Pro is the most comprehensive in terms of input modalities:
- Text
- Images
- Audio files (not real-time voice like GPT-4o, but audio file input)
- Video files (native video understanding, not just frame extraction)
- Documents
The 1M token context window applies across all modalities. You can submit an hour of video (as frames plus audio) and ask questions about specific moments. This is a capability that GPT-4o and Claude do not match.
Llava (Open Source)
For self-hosted multimodal inference, Llava is the leading open source option. Llava-1.6 (also called LLaVA-NeXT) combines a vision encoder with Llama or Mistral as the language backbone.
ollama run llava
Quality is behind GPT-4o and Claude 3.5 for complex vision tasks, but it is free, open source, and runs locally. For simpler image understanding tasks (object identification, basic document reading, scene description), it is viable.
Common Use Cases
Analyzing Screenshots
Users submit screenshots of dashboards, error messages, or UI bugs and ask for analysis. GPT-4o and Claude 3.5 are both strong here. A screenshot of an error trace can be directly submitted for debugging assistance without copying text manually.
Reading Charts and Graphs
Models can extract values from bar charts, identify trends in line graphs, and describe what data visualizations show. Performance is solid for standard chart types. Complex or cluttered visualizations are harder. For high-stakes data extraction from charts, always verify with the underlying data when available.
Transcribing and Understanding Audio
Gemini 1.5 Pro accepts audio files and can transcribe and reason about the content simultaneously. GPT-4o's real-time audio mode is designed for conversation, not file transcription. For meeting transcription and analysis, specialized tools (Whisper for transcription, then LLM for analysis) often work better than a single multimodal call.
Processing Documents With Images
Technical manuals, research papers, and financial reports often contain a mix of text, tables, and figures. Claude's PDF support and Gemini's document understanding handle these holistically. For a paper with embedded charts, you can ask "What does Figure 3 show and how does it support the authors' conclusion?" without separately extracting the figure.
Limitations
Visual Hallucinations
Multimodal models sometimes confidently describe things that are not in an image or misread text from images. For document processing where accuracy is critical (contracts, medical records), always verify extracted information. The models are far better than they were two years ago, but they still make visual errors.
Degraded Text Accuracy in Images
Reading small text, especially in complex layouts with multiple fonts and sizes, is still imperfect. OCR-specialized tools (Google Vision API, AWS Textract) outperform general-purpose multimodal LLMs on pure text extraction accuracy from images.
Audio vs Video Reliability
Real-time audio conversation (GPT-4o voice mode) works well for clear speech in quiet environments. For accented speech, multiple speakers, or noisy audio, accuracy degrades. Purpose-built transcription models (Whisper, Deepgram) typically outperform multimodal LLMs for pure transcription accuracy.
Cost Considerations
Multimodal inputs cost more than text-only:
- GPT-4o image input: images are resized and converted to tokens. A 512x512 image costs roughly 170 tokens in high-detail mode.
- Gemini 1.5 Pro video: charged per second of video, at higher rates than text
- Audio input: generally charged per second or per audio token
For applications processing high volumes of images or audio, the cost can become significant. Consider whether you need the holistic reasoning of a multimodal model or whether OCR plus text-only LLM is sufficient for your use case.
Keep Reading
- Gemini 1.5 Pro vs GPT-4o Comparison — Deep dive into the two leading multimodal models
- LLM Embeddings Explained — How to build efficient retrieval over multimodal content
- How Large Language Models Work: Complete Guide — The architecture behind multimodal understanding
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.