Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Multimodal prompting lets you send images alongside text instructions. Knowing what to ask and how to ask it determines whether you get useful or vague results.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 18, 2026

8 min read

// tags

#multimodal#image-prompting#gpt-4o#claude#ocr#vision-models

FIG. ART-26

8 min read

“

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

// reading plan

sections

1,336

words

min read

// LLM & Language Models

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

OpenAI's frontier models and Codex are now available on AWS through Amazon Bedrock and SageMaker. This post covers what's included, how it works, and the practical tradeoffs for teams considering this integration.

4 min read

// Developer Tools

How to Use AI Models as Tools: Task Routing Matrix for Developers

Referencing Specific Regions of an Image

When an image contains multiple elements (multiple charts on a dashboard, multiple sections of a document, multiple UI components), you need to tell the model which region to focus on. Models can follow spatial references accurately.

Effective spatial references:

"In the top-left chart..."
"In the table in the bottom half of the image..."
"In the highlighted section..."
"In the legend on the right side..."
"In the first row of the data table..."

If the image does not have obvious spatial markers, describe a visual characteristic: "In the chart with the blue line..." or "In the section with the heading 'Revenue.'

Use Cases and How to Prompt for Each

Receipt and invoice data extraction:

Extract the following fields from this receipt image. Return a JSON object with these exact keys: vendor_name, date, total_amount, tax_amount, line_items (array of {description, quantity, unit_price}).

If any field is not visible or legible, set its value to null.

The explicit JSON schema tells the model exactly what to extract and in what format. The null instruction handles partial or damaged receipts cleanly.

UI screenshot analysis:

This is a screenshot of a web application. Identify:
1. The main navigation elements and their labels
2. The primary content area and what it contains
3. Any form inputs and their labels
4. Any error messages or warning states visible

Format as a structured list for each category.

Chart interpretation:

This chart shows monthly revenue data. Answer these specific questions:
1. What is the approximate value for [month]?
2. What is the percentage change between [month A] and [month B]?
3. Is the overall trend increasing, decreasing, or flat?
4. Are there any notable anomalies or outliers?

If exact values are not labeled, provide your best estimate based on the scale visible.

The fourth instruction acknowledges that charts often do not label every data point, which prevents the model from refusing to answer or generating vague non-answers.

Document with mixed visual and text content:

This is a page from a technical specification document. The document contains both text and diagrams. Summarize the technical specifications described in the text, and separately describe what the diagram illustrates and how it relates to the specifications.

Image Quality and Its Effect on Accuracy

Model accuracy degrades with image quality. Low-resolution images, blurry text, high compression artifacts, and low contrast all reduce the model's ability to extract information accurately.

Practical guidelines:

Minimum resolution for text extraction: 150 DPI or approximately 1200 pixels on the longer dimension. Below this, small text becomes unreliable.
Charts and diagrams: The axis labels must be legible. If they are too small to read at standard zoom, the model will not read them reliably either.
Photographs of physical documents: Even lighting with no shadows. Shadows on text reduce extraction accuracy significantly.
Screenshots: Use the actual full-resolution screenshot, not a compressed or resized version embedded in a presentation.

When image quality is unavoidable (old documents, field photography), add an instruction: "The image may have limited legibility. Extract what you can read clearly and mark unclear sections as [illegible]."

Comparing Multiple Images

Most multimodal APIs support sending multiple images in one request. This enables comparison tasks.

I'm sending two UI screenshots: the current design (Image 1) and the proposed redesign (Image 2).

Compare them on these dimensions:
1. Navigation structure: What changed?
2. Information hierarchy: What is more or less prominent?
3. Visual consistency: Does the redesign use consistent spacing, typography, and color?
4. Potential usability issues: Are there elements in the redesign that might confuse users familiar with the current design?

For code or UI diff workflows, this multi-image comparison pattern is faster and more accurate than describing the differences in text alone.

Combining Multimodal Input with Structured Output

Multimodal prompts benefit from explicit output format instructions, especially for data extraction use cases. The model is doing two hard things simultaneously - parsing a visual input and generating structured output. Give it explicit formatting to reduce the chances of unstructured prose in the response.

Always specify: whether you want JSON, markdown, a list, or prose. If JSON, include the schema. If the model might not be able to extract certain fields, tell it what to return in that case (null, "not found," or an explanation).

OCR vs. Multimodal Models

Traditional OCR (optical character recognition) tools like Tesseract or cloud OCR APIs extract text from images without understanding context. Multimodal models understand context - they can answer questions about what they see, not just transcribe it.

Use traditional OCR when you need raw text extraction from clean, formatted documents (typed text, standard forms) and you do not need any analysis or interpretation. It is faster and cheaper than a full model call.

Use multimodal models when you need to interpret what the visual content means - answering questions about charts, understanding the relationship between visual elements, extracting structured data from unstructured documents, or analyzing UI designs.

Summary

Effective multimodal prompting requires three things: choosing a model that supports your input type, being specific about what to extract or analyze rather than asking for generic descriptions, and using spatial references when the image contains multiple regions. Image quality directly affects output accuracy. For production use cases, always specify the output format explicitly and test with a sample of representative images including low-quality edge cases.

Keep Reading

Structured Output Prompting Guide - getting clean structured data from image analysis prompts
Prompt Engineering Complete Guide 2026 - the full context for all prompting techniques including multimodal
LLM Output Parsing Guide - reliably handling the output of multimodal extraction prompts

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

Model Capabilities: What Each Supports

Why Specificity Matters More in Multimodal Prompts

Referencing Specific Regions of an Image

Use Cases and How to Prompt for Each

Image Quality and Its Effect on Accuracy

Comparing Multiple Images

Combining Multimodal Input with Structured Output

OCR vs. Multimodal Models

Summary

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Use AI Models as Tools: Task Routing Matrix for Developers

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

Model Capabilities: What Each Supports

Why Specificity Matters More in Multimodal Prompts

Referencing Specific Regions of an Image

Use Cases and How to Prompt for Each

Image Quality and Its Effect on Accuracy

Comparing Multiple Images

Combining Multimodal Input with Structured Output

OCR vs. Multimodal Models

Summary

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

How to Use AI Models as Tools: Task Routing Matrix for Developers

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Few-Shot Patterns

The workspace your team
actually needs