Embedding models convert text into numerical vectors that capture semantic meaning, enabling semantic search, similarity ranking, and retrieval-augmented generation. For most RAG and semantic search use cases, open source embedding models now match or come close to OpenAI's text-embedding-3-small model on standard benchmarks, at zero API cost when run locally. The best open source embedding models for general English text: all-MiniLM-L6-v2 for speed-sensitive applications, all-mpnet-base-v2 for higher quality at moderate latency, and BAAI/bge-m3 for state-of-the-art quality and multilingual support. The right choice depends on your latency requirements, language needs, and whether you are running locally or via API.
Here is the complete comparison.
How Embedding Models Work
An embedding model takes a string of text as input and returns a vector (an array of floating-point numbers) as output. The vector's dimensions encode semantic information: two texts with similar meaning will produce vectors that are close in the vector space (high cosine similarity). Two texts with different meanings will produce vectors that are far apart.
Vector dimensionality varies by model: all-MiniLM-L6-v2 produces 384-dimensional vectors, all-mpnet-base-v2 produces 768-dimensional, text-embedding-3-small produces 1536-dimensional, and BGE-M3 produces 1024-dimensional. Higher dimensionality does not always mean better quality, but it does mean more storage and slower similarity search.
The primary use case in AI applications: you embed your documents and store the vectors in a vector database (Chroma, Pinecone, Weaviate, pgvector). At query time, you embed the user's query and find the most similar document vectors. Those documents are the context you pass to the LLM. This is the core of RAG.
The sentence-transformers Library
sentence-transformers (GitHub: UKPLab/sentence-transformers, 16k+ stars) is the standard Python library for running open source embedding models locally. It wraps Hugging Face Transformers with a simpler API for generating sentence embeddings.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape) # (2, 384)
The library handles tokenization, batching, and normalization automatically. For most use cases, this is all you need.
Model Comparison
all-MiniLM-L6-v2
- Dimensions: 384
- Speed: Very fast (6-layer transformer, small)
- Quality: Good for general semantic similarity
- Use case: Speed-sensitive applications, large batch processing, low-memory environments
- MTEB score: ~56 (MTEB is the standard embedding benchmark)
- Inference speed on CPU: ~14,000 sentences/second on modern hardware
- Memory footprint: ~90MB model size
all-mpnet-base-v2
- Dimensions: 768
- Speed: Moderate (12-layer transformer)
- Quality: Better than MiniLM for most tasks
- Use case: When you need higher quality and can afford slightly more latency
- MTEB score: ~57.8
- Inference speed on CPU: ~4,000 sentences/second
- Memory footprint: ~420MB model size
BAAI/bge-m3
- Dimensions: 1024
- Speed: Slower (large model)
- Quality: State of the art across 100+ languages
- Use case: Multilingual applications, production RAG where quality matters most
- MTEB score: ~62.6 (as of early 2026)
- Supports dense retrieval, sparse retrieval, and multi-vector retrieval from a single model
- Memory footprint: ~2.3GB model size
Nomic Embed Text v1.5
- Dimensions: 768
- Speed: Moderate
- Quality: Competitive with bge-large while being smaller
- Use case: Good balance of quality and speed for English applications
- MTEB score: ~62.4
- Fully open source (Apache 2.0) with released training data
Comparison with OpenAI text-embedding-3-small
OpenAI's text-embedding-3-small:
- Dimensions: 1536 (can be truncated to 256, 512, or 1024)
- Speed: Fast (API, no local compute)
- Quality: MTEB score ~62.3
- Cost: $0.02 per 1M tokens
- No local deployment required
For English-only RAG applications:
bge-m3andnomic-embed-text-v1.5are approximately comparable totext-embedding-3-smallon MTEB- Local models have zero marginal cost but require GPU or CPU compute
- API models have per-call cost but no infrastructure management
When open source embeddings are good enough:
- Any application where MTEB ~60+ is sufficient (most RAG applications)
- Multilingual applications (bge-m3 outperforms text-embedding-3-small for non-English text)
- High-volume applications where per-call API cost is significant
- Privacy-sensitive applications where sending data to external APIs is not acceptable
When OpenAI embeddings are worth it:
- You want zero infrastructure management
- Your embedding volume is low (< 10M tokens/month where API cost is under $200)
- You need the latest embedding model without evaluating and deploying open source alternatives
Running Embeddings at Scale
For production embedding workloads, batch encoding is critical for throughput:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3")
# Batch encode for efficiency
documents = ["doc1 text...", "doc2 text...", ...] # thousands of documents
batch_size = 32
embeddings = model.encode(
documents,
batch_size=batch_size,
show_progress_bar=True,
normalize_embeddings=True # normalize for cosine similarity
)
For GPU acceleration, sentence-transformers automatically uses CUDA if available. A single A10G GPU (available on Hugging Face Spaces or cloud providers for ~$0.60/hour) can encode 100,000-500,000 sentences per hour depending on model and sentence length.
Keep Reading
- Hugging Face Complete Guide — Where to find and download embedding models
- LangChain vs LlamaIndex Comparison — Frameworks that use embedding models for RAG
- Open Source RAG Stack Guide — Building a complete retrieval pipeline
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.