LayoutLMv3: Understanding PDFs, Forms, and Documents With Layout Awareness

Microsoft's LayoutLMv3 pretrains on text, bounding boxes, and image patches together, enabling form understanding, receipt parsing, and document VQA without separate OCR fine-tuning.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

7 min read

// tags

#layoutlmv3#document-ai#microsoft#ocr#form-understanding

FIG. ART-22

7 min read

“

LayoutLMv3: Understanding PDFs, Forms, and Documents With Layout Awareness

// reading plan

sections

445

words

min read

// Developer Tools

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Microsoft has started canceling Claude Code licenses for its employees, signaling a shift in AI tooling strategy. This post explains the context, implications, and what developers should consider.

3 min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

Documents Are Not Plain Text

A PDF invoice has amounts in the top-right, addresses in the top-left, and line items in a table. A standard language model that strips formatting sees "Invoice #1234 Acme Corp 123 Main St Total $5,000" with no spatial context. LayoutLMv3 understands that "Total" in the bottom-right with "$5,000" immediately to its right - within the same bounding box row - means something very different from "Total" appearing in a header.

Architecture: Text + Layout + Image Together

LayoutLMv3 uses a unified multimodal transformer that processes three input streams jointly:

Text tokens - from an OCR engine (Tesseract, Azure Form Recognizer, or any OCR output)
Layout tokens - 2D bounding box coordinates (x1, y1, x2, y2, width, height) for each text token, normalized to [0, 1000]
Image patches - the document image divided into 16×16 patches, processed like ViT

The model is pretrained with three objectives: Masked Language Modeling (MLM) on text, Masked Image Modeling (MIM) on image patches, and Word-Patch Alignment (WPA) - predicting whether a text token and an image patch are aligned.

from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
import torch

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)
model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base")

# apply_ocr=True runs Tesseract internally
image = Image.open("receipt.png").convert("RGB")
encoding = processor(image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoding)

predictions = outputs.logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()

for token_id, box, pred in zip(encoding.input_ids.squeeze().tolist(), token_boxes, predictions):
    token = processor.tokenizer.decode([token_id])
    print(f"Token: {token:15} Box: {box} Label: {pred}")

LayoutLMv3: Understanding PDFs, Forms, and Documents With Layout Awareness

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Documents Are Not Plain Text

Architecture: Text + Layout + Image Together

Fine-Tuning on Custom Documents

Benchmark Datasets for Document AI

Integration With Azure Form Recognizer

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

LayoutLMv3: Understanding PDFs, Forms, and Documents With Layout Awareness

Related Articles

Microsoft Starts Canceling Claude Code Licenses: What Developers Need to Know

Documents Are Not Plain Text

Architecture: Text + Layout + Image Together

Fine-Tuning on Custom Documents

Benchmark Datasets for Document AI

Integration With Azure Form Recognizer

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ONNX: Export Any ML Model and Run It Anywhere

Multimodal Prompting: How to Combine Images and Text for Better LLM Outputs

The workspace your team
actually needs