BERT (Bidirectional Encoder Representations from Transformers) is a pretrained language model released by Google in 2018. It changed NLP by reading text in both directions simultaneously, giving it a richer understanding of context than any previous model. For developers building text classification, search, or question answering features, BERT is still a practical and efficient choice in 2026.
What BERT Actually Is
Before BERT, most language models read text left to right (or occasionally right to left in a separate pass). BERT reads the entire sequence at once and builds a representation of each word that considers everything to its left AND everything to its right. That is the "bidirectional" in the name.
Google trained BERT using two tasks:
Masked Language Modeling (MLM): 15% of input tokens are randomly masked. The model must predict the masked tokens using the surrounding context. This is why BERT can understand context from both directions -- it has to look in both directions to fill in the blank.
Next Sentence Prediction (NSP): Given two sentences, predict whether the second sentence follows the first in the original text. This task helps BERT understand relationships between sentences, which matters for question answering and natural language inference.
BERT was pretrained on English Wikipedia and the BooksCorpus (800M words combined). The base model has 110M parameters. BERT Large has 340M parameters.
How BERT Differs from GPT
This distinction trips up many developers. Both are transformer-based models. The fundamental difference is in architecture and training objective.
GPT is decoder-only. It uses a causal (left-to-right) attention mask so that each token can only attend to tokens before it. This makes GPT naturally suited for text generation -- you predict the next token, append it, then predict the next token again.
BERT is encoder-only. It uses bidirectional attention and is trained to fill in masked tokens, not to predict what comes next. This makes BERT excellent at building rich representations of existing text.
The practical implication: you would not use BERT to generate a paragraph. You would use BERT to understand, classify, or compare paragraphs. For generation tasks, GPT-style models are the right tool. For understanding tasks -- classification, entity extraction, semantic similarity -- BERT-family models are often the better choice.
What BERT Is Actually Good At
Text classification: Sentiment analysis, topic labeling, spam detection. You add a classification head on top of the [CLS] token representation and fine-tune.
Named Entity Recognition (NER): Identify people, organizations, locations, dates in text. BERT assigns a token-level representation to each word, making it natural for sequence labeling.
Question Answering: Given a passage and a question, predict the start and end token positions of the answer span in the passage. BERT-based QA models (like those fine-tuned on SQuAD) work very well for extractive question answering.
Semantic Similarity: Compare two sentences and determine how similar they are in meaning. Sentence-BERT (SBERT) adapts BERT to produce sentence-level embeddings that can be compared with cosine similarity.
Using BERT in 5 Lines of Python
Hugging Face Transformers makes this straightforward:
from transformers import pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")
result = classifier("This product is genuinely excellent.")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
For embeddings:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
inputs = tokenizer("Hello, world!", return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
# CLS token embedding -- shape: (1, 768)
embedding = outputs.last_hidden_state[:, 0, :]
Fine-Tuning vs Using as an Embedding Model
You have two modes of using BERT:
Fine-tuning: Add a task-specific head (classification, NER, QA) and continue training all weights on your labeled data. You get a model that is specialized for your task. Requires labeled training data. A few thousand examples is often enough for a classification task.
Embedding model (frozen BERT): Run your text through BERT and take the [CLS] token output (or mean pool all token outputs) as a fixed-size vector. Use these vectors as features in a downstream model (logistic regression, k-NN, etc.) or for similarity search. Requires no task-specific training data. Lower performance ceiling but far faster to deploy.
For most production use cases where you have labeled data, fine-tuning will outperform frozen embeddings substantially.
DistilBERT: 40% Smaller, 97% of the Performance
DistilBERT (Sanh et al., 2019) is a distilled version of BERT trained using knowledge distillation (a technique where a smaller "student" model is trained to mimic the outputs of a larger "teacher" model).
Results from the paper:
- 40% fewer parameters than BERT-base
- 60% faster inference
- Retains 97% of BERT-base's performance on GLUE benchmark
For most applications, DistilBERT is the right default. The 3% performance difference rarely matters in practice, and the 60% speed gain is meaningful in production.
Other efficient variants worth knowing: ALBERT (parameter sharing across layers), RoBERTa (BERT retrained with better data and training settings -- often outperforms BERT), DeBERTa (adds disentangled attention, state of the art on many benchmarks).
When to Use BERT vs a Modern LLM API
Use BERT (or a BERT-family model) when:
- You have labeled training data and want a specialized, fast classifier
- You need to run inference on-premise (no API calls)
- Latency is critical (BERT inference is milliseconds, not seconds)
- Cost is a constraint (running your own model vs paying per-token API costs)
- Your task is well-defined (classification, NER, extractive QA)
Use a modern LLM API (GPT-4, Claude, Gemini) when:
- Your task requires generation, reasoning, or following complex instructions
- You have little or no labeled training data and need zero-shot performance
- You want to iterate quickly without training infrastructure
- Your task requires world knowledge or multi-step reasoning
The pattern that works well in production: use a fast BERT-family model as a first-pass filter or classifier, and only invoke the expensive LLM API for cases that require it. This keeps costs down and latency manageable.
What Developers Often Get Wrong
The [CLS] token is not always the best sentence representation. For sentence similarity tasks, mean pooling over all token embeddings (or using Sentence-BERT) typically outperforms taking just the [CLS] vector.
BERT has a 512-token limit. Texts longer than 512 tokens need to be chunked before passing to BERT. This is a real operational concern for document-level classification.
BERT is case-sensitive in the "bert-base-cased" model and case-insensitive in "bert-base-uncased". For code or proper nouns, the cased model is usually better. For general text, uncased is fine.
Keep Reading
- The Complete Machine Learning Guide for Software Developers -- start here if you are new to ML
- How Large Language Models Work: A Complete Guide -- understand the broader transformer landscape BERT fits into
- Transfer Learning Explained -- BERT is transfer learning in action; understand the paradigm
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace -- chat, projects, time tracking, AI meeting summaries, and invoicing -- in one tool. Try it free.