Documents Are Not Plain Text
A PDF invoice has amounts in the top-right, addresses in the top-left, and line items in a table. A standard language model that strips formatting sees "Invoice #1234 Acme Corp 123 Main St Total $5,000" with no spatial context. LayoutLMv3 understands that "Total" in the bottom-right with "$5,000" immediately to its right — within the same bounding box row — means something very different from "Total" appearing in a header.
Architecture: Text + Layout + Image Together
LayoutLMv3 uses a unified multimodal transformer that processes three input streams jointly:
- Text tokens — from an OCR engine (Tesseract, Azure Form Recognizer, or any OCR output)
- Layout tokens — 2D bounding box coordinates (x1, y1, x2, y2, width, height) for each text token, normalized to [0, 1000]
- Image patches — the document image divided into 16×16 patches, processed like ViT
The model is pretrained with three objectives: Masked Language Modeling (MLM) on text, Masked Image Modeling (MIM) on image patches, and Word-Patch Alignment (WPA) — predicting whether a text token and an image patch are aligned.
from transformers import LayoutLMv3Processor, LayoutLMv3ForTokenClassification
from PIL import Image
import torch
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=True)
model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base")
# apply_ocr=True runs Tesseract internally
image = Image.open("receipt.png").convert("RGB")
encoding = processor(image, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoding)
predictions = outputs.logits.argmax(-1).squeeze().tolist()
token_boxes = encoding.bbox.squeeze().tolist()
for token_id, box, pred in zip(encoding.input_ids.squeeze().tolist(), token_boxes, predictions):
token = processor.tokenizer.decode([token_id])
print(f"Token: {token:15} Box: {box} Label: {pred}")
Fine-Tuning on Custom Documents
For domain-specific forms, fine-tuning LayoutLMv3 on labeled examples consistently outperforms prompting a general VLM:
from transformers import LayoutLMv3ForTokenClassification, TrainingArguments, Trainer
label2id = {"O": 0, "B-DATE": 1, "B-TOTAL": 2, "B-VENDOR": 3, "B-ADDRESS": 4}
id2label = {v: k for k, v in label2id.items()}
model = LayoutLMv3ForTokenClassification.from_pretrained(
"microsoft/layoutlmv3-base",
num_labels=len(label2id),
id2label=id2label,
label2id=label2id,
)
training_args = TrainingArguments(
output_dir="layoutlmv3-receipts",
per_device_train_batch_size=4,
num_train_epochs=10,
learning_rate=5e-5,
fp16=True,
evaluation_strategy="epoch",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=processor,
)
trainer.train()
Benchmark Datasets for Document AI
- FUNSD (Form Understanding in Noisy Scanned Documents) — 199 forms with semantic entity labels and entity linking; F1 benchmark for form NER
- CORD (Consolidated Receipt Dataset) — 11,000 receipt images with 30 semantic label categories; used for receipt parsing
- DocVQA — 50,000 QA pairs over scanned documents; requires reading comprehension from layout-aware context
LayoutLMv3 achieves F1 > 90% on FUNSD and CORD with standard fine-tuning recipes.
Integration With Azure Form Recognizer
Azure Form Recognizer uses LayoutLM variants as its underlying model for structured extraction. If you need managed infrastructure with SLA guarantees and pre-built models for invoices, receipts, ID cards, and W-2s, Azure Form Recognizer is the production-ready path that avoids managing model serving yourself.