Longformer: Process 4096-Token Documents With Sliding Window Attention

Longformer extends BERT to 4096 tokens using a combination of local sliding window attention and global attention, making it practical for document classification, Q&A, and NER on long-form text.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 7, 2026

7 min read

// tags

#longformer#long-documents#attention#bert#allenai

FIG. ART-28

7 min read

“

Longformer: Process 4096-Token Documents With Sliding Window Attention

// reading plan

sections

418

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

Document Classification

from transformers import LongformerForSequenceClassification, LongformerTokenizerFast
import torch

tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-base-4096")
model = LongformerForSequenceClassification.from_pretrained(
    "allenai/longformer-base-4096",
    num_labels=4
)

long_document = "..." * 1000  # 3000+ word document

inputs = tokenizer(
    long_document,
    return_tensors="pt",
    max_length=4096,
    truncation=True,
    padding="max_length"
)
# Global attention on [CLS] token
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1

outputs = model(**inputs, global_attention_mask=global_attention_mask)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()

The HuggingFace Longformer-large-4096 is the highest-quality variant; base is 3x faster.

Document Q&A With LongformerForQuestionAnswering

from transformers import LongformerForQuestionAnswering, LongformerTokenizerFast

model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")

question = "What is the main argument in section 3?"
document = "...full document text..."

encoding = tokenizer(question, document, return_tensors="pt", max_length=4096, truncation=True)
# Global attention on question tokens
sequence_ids = encoding.sequence_ids(0)
global_attention = [1 if t == 0 else 0 for t in sequence_ids]
encoding["global_attention_mask"] = torch.tensor([global_attention])

outputs = model(**encoding)
start = outputs.start_logits.argmax()
end = outputs.end_logits.argmax()
answer = tokenizer.decode(encoding["input_ids"][0][start:end+1])
print(answer)

Comparison to Big Bird

Big Bird (Google) solves the same long-sequence problem with a different attention pattern: random attention + sliding window + global. Both achieve similar accuracy on long-document benchmarks. Longformer is more widely deployed and has better HuggingFace ecosystem support; Big Bird's random attention may provide marginal advantages on extremely long documents (>4096 tokens with extended variants).

Longformer: Process 4096-Token Documents With Sliding Window Attention

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The BERT 512-Token Problem

When to Use Longformer vs Chunking Strategies

Document Classification

Document Q&A With LongformerForQuestionAnswering

Comparison to Big Bird

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

Longformer: Process 4096-Token Documents With Sliding Window Attention

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

The BERT 512-Token Problem

When to Use Longformer vs Chunking Strategies

Document Classification

Document Q&A With LongformerForQuestionAnswering

Comparison to Big Bird

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs