Multimodal prompting means sending both images and text to a language model in the same request. The model processes both modalities together and produces a text response based on what it sees and what it is instructed to do. The technique unlocks use cases that text-only prompting cannot handle — analyzing charts, extracting data from receipts, explaining UI screenshots, and reading documents with visual elements.
Model Capabilities: What Each Supports
Before building a multimodal workflow, verify that your chosen model supports the input types you need.
GPT-4o (OpenAI): Accepts text, images, and audio input. Handles multiple images in a single request. Supports high-resolution image analysis with the detail parameter set to "high."
Claude 3.5 Sonnet and Claude 3 Opus (Anthropic): Accept text, images, and PDF documents. Can process multiple images. Particularly strong at reading dense documents with visual and text elements interleaved.
Gemini 1.5 Pro and Gemini 2.0 (Google): Accept text, images, audio, and video. The only major model family with native video input support. Can process entire videos and respond to questions about specific moments.
LLaVA and open-source alternatives: Various open-weight models support image + text. Capability varies significantly by model size and training data.
Why Specificity Matters More in Multimodal Prompts
When you send an image to a model, the model can describe it in thousands of different ways depending on what it decides to focus on. Without specific instructions, the model defaults to a generic description of the most visually prominent elements.
Vague prompt: "Describe this image." Model response: "The image shows a bar chart with several colored bars of varying heights. There is a title at the top and axis labels."
Specific prompt: "Identify the category with the highest value in this bar chart, state the exact value if visible, and describe the overall trend across all categories." Model response: "The 'Enterprise' category has the highest value at approximately 2.4M. The overall trend shows growth from Q1 to Q3 followed by a slight decline in Q4 across all categories except 'SMB,' which grew consistently throughout the year."
The second prompt gives the model a concrete task rather than an open-ended instruction. This is the single most important principle of effective multimodal prompting.
Referencing Specific Regions of an Image
When an image contains multiple elements (multiple charts on a dashboard, multiple sections of a document, multiple UI components), you need to tell the model which region to focus on. Models can follow spatial references accurately.
Effective spatial references:
- "In the top-left chart..."
- "In the table in the bottom half of the image..."
- "In the highlighted section..."
- "In the legend on the right side..."
- "In the first row of the data table..."
If the image does not have obvious spatial markers, describe a visual characteristic: "In the chart with the blue line..." or "In the section with the heading 'Revenue.'
Use Cases and How to Prompt for Each
Receipt and invoice data extraction:
Extract the following fields from this receipt image. Return a JSON object with these exact keys: vendor_name, date, total_amount, tax_amount, line_items (array of {description, quantity, unit_price}).
If any field is not visible or legible, set its value to null.
The explicit JSON schema tells the model exactly what to extract and in what format. The null instruction handles partial or damaged receipts cleanly.
UI screenshot analysis:
This is a screenshot of a web application. Identify:
1. The main navigation elements and their labels
2. The primary content area and what it contains
3. Any form inputs and their labels
4. Any error messages or warning states visible
Format as a structured list for each category.
Chart interpretation:
This chart shows monthly revenue data. Answer these specific questions:
1. What is the approximate value for [month]?
2. What is the percentage change between [month A] and [month B]?
3. Is the overall trend increasing, decreasing, or flat?
4. Are there any notable anomalies or outliers?
If exact values are not labeled, provide your best estimate based on the scale visible.
The fourth instruction acknowledges that charts often do not label every data point, which prevents the model from refusing to answer or generating vague non-answers.
Document with mixed visual and text content:
This is a page from a technical specification document. The document contains both text and diagrams. Summarize the technical specifications described in the text, and separately describe what the diagram illustrates and how it relates to the specifications.
Image Quality and Its Effect on Accuracy
Model accuracy degrades with image quality. Low-resolution images, blurry text, high compression artifacts, and low contrast all reduce the model's ability to extract information accurately.
Practical guidelines:
- Minimum resolution for text extraction: 150 DPI or approximately 1200 pixels on the longer dimension. Below this, small text becomes unreliable.
- Charts and diagrams: The axis labels must be legible. If they are too small to read at standard zoom, the model will not read them reliably either.
- Photographs of physical documents: Even lighting with no shadows. Shadows on text reduce extraction accuracy significantly.
- Screenshots: Use the actual full-resolution screenshot, not a compressed or resized version embedded in a presentation.
When image quality is unavoidable (old documents, field photography), add an instruction: "The image may have limited legibility. Extract what you can read clearly and mark unclear sections as [illegible]."
Comparing Multiple Images
Most multimodal APIs support sending multiple images in one request. This enables comparison tasks.
I'm sending two UI screenshots: the current design (Image 1) and the proposed redesign (Image 2).
Compare them on these dimensions:
1. Navigation structure: What changed?
2. Information hierarchy: What is more or less prominent?
3. Visual consistency: Does the redesign use consistent spacing, typography, and color?
4. Potential usability issues: Are there elements in the redesign that might confuse users familiar with the current design?
For code or UI diff workflows, this multi-image comparison pattern is faster and more accurate than describing the differences in text alone.
Combining Multimodal Input with Structured Output
Multimodal prompts benefit from explicit output format instructions, especially for data extraction use cases. The model is doing two hard things simultaneously — parsing a visual input and generating structured output. Give it explicit formatting to reduce the chances of unstructured prose in the response.
Always specify: whether you want JSON, markdown, a list, or prose. If JSON, include the schema. If the model might not be able to extract certain fields, tell it what to return in that case (null, "not found," or an explanation).
OCR vs. Multimodal Models
Traditional OCR (optical character recognition) tools like Tesseract or cloud OCR APIs extract text from images without understanding context. Multimodal models understand context — they can answer questions about what they see, not just transcribe it.
Use traditional OCR when you need raw text extraction from clean, formatted documents (typed text, standard forms) and you do not need any analysis or interpretation. It is faster and cheaper than a full model call.
Use multimodal models when you need to interpret what the visual content means — answering questions about charts, understanding the relationship between visual elements, extracting structured data from unstructured documents, or analyzing UI designs.
Summary
Effective multimodal prompting requires three things: choosing a model that supports your input type, being specific about what to extract or analyze rather than asking for generic descriptions, and using spatial references when the image contains multiple regions. Image quality directly affects output accuracy. For production use cases, always specify the output format explicitly and test with a sample of representative images including low-quality edge cases.
Keep Reading
- Structured Output Prompting Guide — getting clean structured data from image analysis prompts
- Prompt Engineering Complete Guide 2026 — the full context for all prompting techniques including multimodal
- LLM Output Parsing Guide — reliably handling the output of multimodal extraction prompts
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.