Open Source Embedding Models: Which One to Use in 2026

Pristren

// reading plan

sections

861

words

min read

// contentsjump to section

01How Embedding Models Work
02The sentence-transformers Library
03Model Comparison
04Comparison with OpenAI text-embedding-3-small

// article

Embedding models convert text into numerical vectors that capture semantic meaning, enabling semantic search, similarity ranking, and retrieval-augmented generation. For most RAG and semantic search use cases, open source embedding models now match or come close to OpenAI's text-embedding-3-small model on standard benchmarks, at zero API cost when run locally. The best open source embedding models for general English text: all-MiniLM-L6-v2 for speed-sensitive applications, all-mpnet-base-v2 for higher quality at moderate latency, and BAAI/bge-m3 for state-of-the-art quality and multilingual support. The right choice depends on your latency requirements, language needs, and whether you are running locally or via API.

Here is the complete comparison.

How Embedding Models Work

An embedding model takes a string of text as input and returns a vector (an array of floating-point numbers) as output. The vector's dimensions encode semantic information: two texts with similar meaning will produce vectors that are close in the vector space (high cosine similarity). Two texts with different meanings will produce vectors that are far apart.

Vector dimensionality varies by model: all-MiniLM-L6-v2 produces 384-dimensional vectors, all-mpnet-base-v2 produces 768-dimensional, text-embedding-3-small produces 1536-dimensional, and BGE-M3 produces 1024-dimensional. Higher dimensionality does not always mean better quality, but it does mean more storage and slower similarity search.

The primary use case in AI applications: you embed your documents and store the vectors in a vector database (Chroma, Pinecone, Weaviate, pgvector). At query time, you embed the user's query and find the most similar document vectors. Those documents are the context you pass to the LLM. This is the core of RAG.

The sentence-transformers Library

sentence-transformers (GitHub: UKPLab/sentence-transformers, 16k+ stars) is the standard Python library for running open source embedding models locally. It wraps Hugging Face Transformers with a simpler API for generating sentence embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (2, 384)

The library handles tokenization, batching, and normalization automatically. For most use cases, this is all you need.

// stay current

AI & ML insights, weekly

Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.

// written byFIG. AUTH-01

530

Mahmudul Haque Qudrati

CEO & ML Engineer

CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.

// continue reading

LLM Cost Estimation: Budgeting for Multi-User AI Applications in Production

9 min read

Optimizing Context Window Usage: Context Pruning and Summarization Techniques

7 min read

Model Comparison

all-MiniLM-L6-v2

Dimensions: 384
Speed: Very fast (6-layer transformer, small)
Quality: Good for general semantic similarity
Use case: Speed-sensitive applications, large batch processing, low-memory environments
MTEB score: ~56 (MTEB is the standard embedding benchmark)
Inference speed on CPU: ~14,000 sentences/second on modern hardware
Memory footprint: ~90MB model size

all-mpnet-base-v2

Dimensions: 768
Speed: Moderate (12-layer transformer)
Quality: Better than MiniLM for most tasks
Use case: When you need higher quality and can afford slightly more latency
MTEB score: ~57.8
Inference speed on CPU: ~4,000 sentences/second
Memory footprint: ~420MB model size

BAAI/bge-m3

Dimensions: 1024
Speed: Slower (large model)
Quality: State of the art across 100+ languages
Use case: Multilingual applications, production RAG where quality matters most
MTEB score: ~62.6 (as of early 2026)
Supports dense retrieval, sparse retrieval, and multi-vector retrieval from a single model
Memory footprint: ~2.3GB model size

Nomic Embed Text v1.5

Dimensions: 768
Speed: Moderate
Quality: Competitive with bge-large while being smaller
Use case: Good balance of quality and speed for English applications
MTEB score: ~62.4
Fully open source (Apache 2.0) with released training data

Comparison with OpenAI text-embedding-3-small

OpenAI's text-embedding-3-small:

Dimensions: 1536 (can be truncated to 256, 512, or 1024)
Speed: Fast (API, no local compute)
Quality: MTEB score ~62.3
Cost: $0.02 per 1M tokens
No local deployment required

For English-only RAG applications:

bge-m3 and nomic-embed-text-v1.5 are approximately comparable to text-embedding-3-small on MTEB
Local models have zero marginal cost but require GPU or CPU compute
API models have per-call cost but no infrastructure management

When open source embeddings are good enough:

Any application where MTEB ~60+ is sufficient (most RAG applications)
Multilingual applications (bge-m3 outperforms text-embedding-3-small for non-English text)
High-volume applications where per-call API cost is significant
Privacy-sensitive applications where sending data to external APIs is not acceptable

When OpenAI embeddings are worth it:

You want zero infrastructure management
Your embedding volume is low (< 10M tokens/month where API cost is under $200)
You need the latest embedding model without evaluating and deploying open source alternatives

Running Embeddings at Scale

For production embedding workloads, batch encoding is critical for throughput:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

# Batch encode for efficiency
documents = ["doc1 text...", "doc2 text...", ...]  # thousands of documents
batch_size = 32
embeddings = model.encode(
    documents,
    batch_size=batch_size,
    show_progress_bar=True,
    normalize_embeddings=True  # normalize for cosine similarity
)

For GPU acceleration, sentence-transformers automatically uses CUDA if available. A single A10G GPU (available on Hugging Face Spaces or cloud providers for ~$0.60/hour) can encode 100,000-500,000 sentences per hour depending on model and sentence length.

Keep Reading

Hugging Face Complete Guide - Where to find and download embedding models
LangChain vs LlamaIndex Comparison - Frameworks that use embedding models for RAG
Open Source RAG Stack Guide - Building a complete retrieval pipeline

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.

Frequently Asked Questions

What are open source embedding models?

Open source embedding models are pre-trained neural network models that convert text into numerical vectors (embeddings) representing semantic meaning. They are freely available under open source licenses (e.g., Apache 2.0, MIT) and can be run locally or on your own infrastructure without API fees. Popular examples include all-MiniLM-L6-v2, BAAI/bge-m3, and Nomic Embed Text v1.5.

How do open source embedding models work?

These models use transformer architectures (like BERT) to process text and output a fixed-size vector. The model is trained on large text corpora to map semantically similar sentences to nearby points in the vector space. At inference, you pass a string through the model, and it returns an array of floats. The vectors can be compared using cosine similarity to find similar texts. Libraries like sentence-transformers simplify this process.

What are the best practices for using open source embedding models?

Best practices include: (1) Normalize embeddings to unit length for cosine similarity. (2) Use batch processing for efficiency. (3) Choose a model dimension that balances quality and storage (384-1024). (4) For multilingual use, pick BGE-M3 or similar. (5) Test on your specific domain data, as MTEB scores may not reflect domain-specific performance. (6) Consider using a vector database for scalable similarity search.

How much does it cost to use open source embedding models?

The models themselves are free. The cost is infrastructure: CPU/GPU compute and memory. Running all-MiniLM-L6-v2 on CPU costs only electricity (negligible). For large-scale production, a GPU like A10G (~$0.60/hour) can process 100k-500k sentences per hour. Compare to OpenAI's text-embedding-3-small at $0.02/1M tokens — for high volume, open source is cheaper.

Is using open source embedding models worth it in 2026?

Yes, for most RAG and semantic search use cases. Models like BGE-M3 and Nomic Embed match OpenAI's quality on MTEB benchmarks. Open source is worth it if: you have high volume, need multilingual support, require data privacy, or want to avoid API dependency. It may not be worth it if you have low volume and prefer zero infrastructure management.

Open Source Embedding Models: Which One to Use in 2026

How Embedding Models Work