Why Sentence Transformers
Raw BERT produces token embeddings, not sentence embeddings. Averaging BERT token outputs gives poor semantic representations — similar sentences get dissimilar vectors. Sentence Transformers (SBERT) fixes this by fine-tuning BERT-style models with siamese networks on natural language inference pairs, producing embeddings where cosine similarity directly correlates with semantic similarity.
The HuggingFace Sentence Transformers collection hosts 200+ pre-trained models covering different size/quality tradeoffs.
Encoding and Cosine Similarity
from sentence_transformers import SentenceTransformer, util
import torch
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy canine",
"The stock market closed higher today",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
# Pairwise cosine similarity
cos_sim = util.cos_sim(embeddings, embeddings)
print(f"Sentences 0 and 1 similarity: {cos_sim[0][1]:.4f}") # ~0.72 (semantically similar)
print(f"Sentences 0 and 2 similarity: {cos_sim[0][2]:.4f}") # ~0.05 (unrelated)
Semantic Search With util.semantic_search
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-mpnet-base-v2")
corpus = [
"Python is a high-level programming language",
"Machine learning requires large datasets",
"Neural networks are inspired by the human brain",
"Flask is a lightweight web framework",
]
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
query = "web development frameworks"
query_embedding = model.encode(query, convert_to_tensor=True)
hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=2)
for hit in hits[0]:
print(f"Score: {hit['score']:.4f} | {corpus[hit['corpus_id']]}")
Fine-Tuning on Custom Pairs
Use MultipleNegativesRankingLoss when you have (anchor, positive) pairs without explicit negatives — the other items in the batch serve as negatives:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
model = SentenceTransformer("all-MiniLM-L6-v2")
train_examples = [
InputExample(texts=["What is machine learning?", "ML is a type of AI that learns from data"]),
InputExample(texts=["How do I fix a bug?", "Debugging requires isolating the failing component"]),
]
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=3,
warmup_steps=100,
)
model.save("my-finetuned-model")
Best Models Comparison
| Model | Dimensions | Speed | Quality | |---|---|---|---| | all-MiniLM-L6-v2 | 384 | Very fast | Good | | all-mpnet-base-v2 | 768 | Fast | Better | | multi-qa-mpnet-base-dot-v1 | 768 | Fast | Best for QA | | BGE-M3 | 1024 | Moderate | Best overall |
The GitHub repository includes pretrained model benchmarks on STS, QA, and retrieval tasks. For most production RAG use cases, all-mpnet-base-v2 is the baseline to beat before reaching for larger models.