The BERT 512-Token Problem
Standard transformer attention is quadratic in sequence length — doubling the sequence length quadruples computation. BERT's 512-token limit means long documents must be chunked, losing inter-chunk context. A question that spans two chunks, or a document summary that requires reading the conclusion before understanding the introduction, breaks with naive chunking.
Longformer from AllenAI solves this with two attention mechanisms applied together:
- Sliding window attention: Each token attends to
window_sizeneighbors on each side. Local context is preserved efficiently at O(n × w) cost. - Global attention: Selected tokens (like
[CLS]or question tokens) attend to all other tokens and are attended to by all other tokens. Global tokens gather document-level context.
This combination maintains O(n) complexity while preserving the ability to reason across the full document.
When to Use Longformer vs Chunking Strategies
Use Longformer when:
- Questions or labels depend on evidence scattered across the document
- You need document-level representations (not sentence-level)
- Documents are 600-4000 tokens consistently
Use chunking when:
- Documents are 4000+ tokens (Longformer's 4096 limit still applies)
- Questions can always be answered from a local passage
- Throughput matters more than accuracy (chunking + retrieval is faster)
The sweet spot is 800-2500 token documents — long enough that BERT fails, short enough that Longformer handles cleanly.
Document Classification
from transformers import LongformerForSequenceClassification, LongformerTokenizerFast
import torch
tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-base-4096")
model = LongformerForSequenceClassification.from_pretrained(
"allenai/longformer-base-4096",
num_labels=4
)
long_document = "..." * 1000 # 3000+ word document
inputs = tokenizer(
long_document,
return_tensors="pt",
max_length=4096,
truncation=True,
padding="max_length"
)
# Global attention on [CLS] token
global_attention_mask = torch.zeros_like(inputs["input_ids"])
global_attention_mask[:, 0] = 1
outputs = model(**inputs, global_attention_mask=global_attention_mask)
logits = outputs.logits
predicted_class = logits.argmax(dim=-1).item()
The HuggingFace Longformer-large-4096 is the highest-quality variant; base is 3x faster.
Document Q&A With LongformerForQuestionAnswering
from transformers import LongformerForQuestionAnswering, LongformerTokenizerFast
model = LongformerForQuestionAnswering.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
tokenizer = LongformerTokenizerFast.from_pretrained("allenai/longformer-large-4096-finetuned-triviaqa")
question = "What is the main argument in section 3?"
document = "...full document text..."
encoding = tokenizer(question, document, return_tensors="pt", max_length=4096, truncation=True)
# Global attention on question tokens
sequence_ids = encoding.sequence_ids(0)
global_attention = [1 if t == 0 else 0 for t in sequence_ids]
encoding["global_attention_mask"] = torch.tensor([global_attention])
outputs = model(**encoding)
start = outputs.start_logits.argmax()
end = outputs.end_logits.argmax()
answer = tokenizer.decode(encoding["input_ids"][0][start:end+1])
print(answer)
Comparison to Big Bird
Big Bird (Google) solves the same long-sequence problem with a different attention pattern: random attention + sliding window + global. Both achieve similar accuracy on long-document benchmarks. Longformer is more widely deployed and has better HuggingFace ecosystem support; Big Bird's random attention may provide marginal advantages on extremely long documents (>4096 tokens with extended variants).