Multimodal LLMs: Working With Text, Images, and Audio Together

Multimodal LLMs process text, images, audio, and video in a single model, enabling use cases like document analysis, chart understanding, and audio transcription without separate pipelines.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

7 min read

// tags

#multimodal#vision-llm#gpt-4o#gemini#image-understanding

FIG. ART-20

7 min read

“

Multimodal LLMs: Working With Text, Images, and Audio Together

// reading plan

sections

866

words

min read

// LLM & Language Models

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Which LLMs write the best code in 2026, what the benchmarks actually measure, how to get better output, and where generated code will still burn you.

9 min read

// LLM & Language Models

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

A multimodal LLM is a model that processes multiple input types — text, images, audio, video, or documents — in a single model without separate processing pipelines. GPT-4o handles text, images, and audio. Claude 3.5 Sonnet handles text, images, and PDFs. Gemini 1.5 Pro adds native video understanding. Using multimodal models simplifies architecture and enables reasoning that crosses modalities.

Current Capabilities by Model

GPT-4o

GPT-4o (the "o" stands for "omni") is OpenAI's natively multimodal flagship. Inputs it accepts:

Text (obviously)
Images: screenshots, photos, diagrams, charts, handwriting
Audio: real-time voice conversation with notably natural latency

GPT-4o's vision is strong for charts, code screenshots, UI mockups, and handwritten text. Its audio mode is the most natural-sounding and low-latency voice interface available in any commercial API.

Limitation: GPT-4o does not accept video files. For video, you would need to extract frames and submit as images.

Claude 3.5 Sonnet and Opus

Anthropic's models accept text, images, and PDF documents. The PDF support is a meaningful differentiator: you can upload a PDF directly (up to 100 pages) and Claude reads both the text and embedded images natively, without external OCR or text extraction.

Claude 3.5 Sonnet is particularly strong at detailed image analysis and following complex visual instructions. Its vision quality for technical diagrams and dense charts is competitive with GPT-4o.

Gemini 1.5 Pro

Gemini 1.5 Pro is the most comprehensive in terms of input modalities:

Text
Images
Audio files (not real-time voice like GPT-4o, but audio file input)
Video files (native video understanding, not just frame extraction)
Documents

The 1M token context window applies across all modalities. You can submit an hour of video (as frames plus audio) and ask questions about specific moments. This is a capability that GPT-4o and Claude do not match.

Llava (Open Source)

For self-hosted multimodal inference, Llava is the leading open source option. Llava-1.6 (also called LLaVA-NeXT) combines a vision encoder with Llama or Mistral as the language backbone.

ollama run llava

Quality is behind GPT-4o and Claude 3.5 for complex vision tasks, but it is free, open source, and runs locally. For simpler image understanding tasks (object identification, basic document reading, scene description), it is viable.

Common Use Cases

Analyzing Screenshots

Users submit screenshots of dashboards, error messages, or UI bugs and ask for analysis. GPT-4o and Claude 3.5 are both strong here. A screenshot of an error trace can be directly submitted for debugging assistance without copying text manually.

Reading Charts and Graphs

Models can extract values from bar charts, identify trends in line graphs, and describe what data visualizations show. Performance is solid for standard chart types. Complex or cluttered visualizations are harder. For high-stakes data extraction from charts, always verify with the underlying data when available.

Transcribing and Understanding Audio

Gemini 1.5 Pro accepts audio files and can transcribe and reason about the content simultaneously. GPT-4o's real-time audio mode is designed for conversation, not file transcription. For meeting transcription and analysis, specialized tools (Whisper for transcription, then LLM for analysis) often work better than a single multimodal call.

Processing Documents With Images

Technical manuals, research papers, and financial reports often contain a mix of text, tables, and figures. Claude's PDF support and Gemini's document understanding handle these holistically. For a paper with embedded charts, you can ask "What does Figure 3 show and how does it support the authors' conclusion?" without separately extracting the figure.

Limitations

Visual Hallucinations

Multimodal models sometimes confidently describe things that are not in an image or misread text from images. For document processing where accuracy is critical (contracts, medical records), always verify extracted information. The models are far better than they were two years ago, but they still make visual errors.

Degraded Text Accuracy in Images

Reading small text, especially in complex layouts with multiple fonts and sizes, is still imperfect. OCR-specialized tools (Google Vision API, AWS Textract) outperform general-purpose multimodal LLMs on pure text extraction accuracy from images.

Audio vs Video Reliability

Real-time audio conversation (GPT-4o voice mode) works well for clear speech in quiet environments. For accented speech, multiple speakers, or noisy audio, accuracy degrades. Purpose-built transcription models (Whisper, Deepgram) typically outperform multimodal LLMs for pure transcription accuracy.

Cost Considerations

Multimodal inputs cost more than text-only:

GPT-4o image input: images are resized and converted to tokens. A 512x512 image costs roughly 170 tokens in high-detail mode.
Gemini 1.5 Pro video: charged per second of video, at higher rates than text
Audio input: generally charged per second or per audio token

For applications processing high volumes of images or audio, the cost can become significant. Consider whether you need the holistic reasoning of a multimodal model or whether OCR plus text-only LLM is sufficient for your use case.

Keep Reading

Gemini 1.5 Pro vs GPT-4o Comparison — Deep dive into the two leading multimodal models
LLM Embeddings Explained — How to build efficient retrieval over multimodal content
How Large Language Models Work: Complete Guide — The architecture behind multimodal understanding

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Multimodal LLMs: Working With Text, Images, and Audio Together

Related Articles

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Current Capabilities by Model

GPT-4o

Claude 3.5 Sonnet and Opus

Gemini 1.5 Pro

Llava (Open Source)

Common Use Cases

Analyzing Screenshots

Reading Charts and Graphs

Transcribing and Understanding Audio

Processing Documents With Images

Limitations

Visual Hallucinations

Degraded Text Accuracy in Images

Audio vs Video Reliability

Cost Considerations

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

Multimodal LLMs: Working With Text, Images, and Audio Together

Related Articles

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Current Capabilities by Model

GPT-4o

Claude 3.5 Sonnet and Opus

Gemini 1.5 Pro

Llava (Open Source)

Common Use Cases

Analyzing Screenshots

Reading Charts and Graphs

Transcribing and Understanding Audio

Processing Documents With Images

Limitations

Visual Hallucinations

Degraded Text Accuracy in Images

Audio vs Video Reliability

Cost Considerations

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

The workspace your team
actually needs