Three Retrieval Modes in One Model
Most embedding models produce a single dense vector per text. BGE-M3 produces three types of representations simultaneously, each useful for different retrieval scenarios:
- Dense retrieval: Single 1024-dimension vector, standard cosine similarity, fast ANN search
- Sparse retrieval: Weighted term importance scores (like BM25 but learned), exact keyword matching advantage
- Multi-vector (ColBERT): Token-level embeddings for late interaction scoring, highest accuracy but more compute
The BGE-M3 paper shows that combining all three (hybrid search) outperforms any single mode on BEIR benchmark by 2-5 points on nDCG@10.
MTEB Benchmark Performance
On the Massive Text Embedding Benchmark (MTEB), BGE-M3 dense retrieval scores competitively with OpenAI's text-embedding-3-large on English tasks, while significantly outperforming it on multilingual tasks. The 100+ language support includes Chinese, Japanese, Korean, Arabic, and European languages with strong performance.
The HuggingFace model page includes detailed benchmark tables per language and task type.
FlagEmbedding Python Library
The FlagEmbedding GitHub repository provides optimized inference:
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel(
"BAAI/bge-m3",
use_fp16=True # Half precision for faster inference
)
sentences = [
"What is machine learning?",
"ML is a subset of AI that learns from data.",
"How do neural networks work?",
]
# Dense embeddings
dense_embeddings = model.encode(
sentences,
batch_size=12,
max_length=8192,
)["dense_vecs"]
# All three modes simultaneously
all_embeddings = model.encode(
sentences,
return_dense=True,
return_sparse=True,
return_colbert_vecs=True,
)
print(f"Dense shape: {all_embeddings['dense_vecs'].shape}")
print(f"Sparse keys example: {list(all_embeddings['lexical_weights'][0].keys())[:5]}")
Hybrid Search Combination Strategy
For production RAG, combine dense and sparse scores with a weighted sum:
def hybrid_score(dense_score: float, sparse_score: float, alpha: float = 0.5) -> float:
"""
alpha=1.0: pure dense (semantic)
alpha=0.0: pure sparse (keyword)
alpha=0.5: balanced hybrid
"""
return alpha * dense_score + (1 - alpha) * sparse_score
Set alpha higher (0.7-0.8) for semantic queries, lower (0.2-0.3) for keyword-heavy queries like product searches or code lookups.
8192 Token Input Window
The 8192 token limit is a significant practical advantage over models capped at 512 or 2048 tokens. You can embed entire documents without chunking — a research paper abstract plus full introduction, a complete API documentation page, a lengthy product description.
Comparison to OpenAI text-embedding-3-large
| Metric | BGE-M3 (dense) | text-embedding-3-large | |---|---|---| | Dimensions | 1024 | 3072 (reducible) | | Max tokens | 8192 | 8191 | | Languages | 100+ | ~50 effective | | Cost | Free (self-hosted) | $0.13/1M tokens | | Hybrid search | Built-in | External BM25 needed |
For teams running >100M tokens/month, self-hosting BGE-M3 on a single GPU instance pays back infrastructure cost within weeks.