Open Source Embedding Models: Which One to Use in 2026

sentence-transformers, BGE-M3, and Nomic Embed are your main options. Here is how they compare to OpenAI's embeddings and when open source is good enough.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#embedding-models#sentence-transformers#rag#semantic-search

FIG. ART-28

8 min read

“

Open Source Embedding Models: Which One to Use in 2026

// reading plan

sections

861

words

min read

// Machine Learning

Building Semantic Search: Finding Results by Meaning, Not Keywords

How semantic search works, embedding-based architecture, pgvector vs ChromaDB, hybrid search with BM25, and cross-encoder re-ranking for better results.

10 min read

// Prompt Engineering

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Embedding models convert text into numerical vectors that capture semantic meaning, enabling semantic search, similarity ranking, and retrieval-augmented generation. For most RAG and semantic search use cases, open source embedding models now match or come close to OpenAI's text-embedding-3-small model on standard benchmarks, at zero API cost when run locally. The best open source embedding models for general English text: all-MiniLM-L6-v2 for speed-sensitive applications, all-mpnet-base-v2 for higher quality at moderate latency, and BAAI/bge-m3 for state-of-the-art quality and multilingual support. The right choice depends on your latency requirements, language needs, and whether you are running locally or via API.

Here is the complete comparison.

How Embedding Models Work

An embedding model takes a string of text as input and returns a vector (an array of floating-point numbers) as output. The vector's dimensions encode semantic information: two texts with similar meaning will produce vectors that are close in the vector space (high cosine similarity). Two texts with different meanings will produce vectors that are far apart.

Vector dimensionality varies by model: all-MiniLM-L6-v2 produces 384-dimensional vectors, all-mpnet-base-v2 produces 768-dimensional, text-embedding-3-small produces 1536-dimensional, and BGE-M3 produces 1024-dimensional. Higher dimensionality does not always mean better quality, but it does mean more storage and slower similarity search.

The primary use case in AI applications: you embed your documents and store the vectors in a vector database (Chroma, Pinecone, Weaviate, pgvector). At query time, you embed the user's query and find the most similar document vectors. Those documents are the context you pass to the LLM. This is the core of RAG.

The sentence-transformers Library

sentence-transformers (GitHub: UKPLab/sentence-transformers, 16k+ stars) is the standard Python library for running open source embedding models locally. It wraps Hugging Face Transformers with a simpler API for generating sentence embeddings.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = ["This is an example sentence", "Each sentence is converted"]
embeddings = model.encode(sentences)
print(embeddings.shape)  # (2, 384)

The library handles tokenization, batching, and normalization automatically. For most use cases, this is all you need.

Model Comparison

all-MiniLM-L6-v2

Dimensions: 384
Speed: Very fast (6-layer transformer, small)
Quality: Good for general semantic similarity
Use case: Speed-sensitive applications, large batch processing, low-memory environments
MTEB score: ~56 (MTEB is the standard embedding benchmark)
Inference speed on CPU: ~14,000 sentences/second on modern hardware
Memory footprint: ~90MB model size

all-mpnet-base-v2

Dimensions: 768
Speed: Moderate (12-layer transformer)
Quality: Better than MiniLM for most tasks
Use case: When you need higher quality and can afford slightly more latency
MTEB score: ~57.8
Inference speed on CPU: ~4,000 sentences/second
Memory footprint: ~420MB model size

BAAI/bge-m3

Dimensions: 1024
Speed: Slower (large model)
Quality: State of the art across 100+ languages
Use case: Multilingual applications, production RAG where quality matters most
MTEB score: ~62.6 (as of early 2026)
Supports dense retrieval, sparse retrieval, and multi-vector retrieval from a single model
Memory footprint: ~2.3GB model size

Nomic Embed Text v1.5

Dimensions: 768
Speed: Moderate
Quality: Competitive with bge-large while being smaller
Use case: Good balance of quality and speed for English applications
MTEB score: ~62.4
Fully open source (Apache 2.0) with released training data

Comparison with OpenAI text-embedding-3-small

OpenAI's text-embedding-3-small:

Dimensions: 1536 (can be truncated to 256, 512, or 1024)
Speed: Fast (API, no local compute)
Quality: MTEB score ~62.3
Cost: $0.02 per 1M tokens
No local deployment required

For English-only RAG applications:

bge-m3 and nomic-embed-text-v1.5 are approximately comparable to text-embedding-3-small on MTEB
Local models have zero marginal cost but require GPU or CPU compute
API models have per-call cost but no infrastructure management

When open source embeddings are good enough:

Any application where MTEB ~60+ is sufficient (most RAG applications)
Multilingual applications (bge-m3 outperforms text-embedding-3-small for non-English text)
High-volume applications where per-call API cost is significant
Privacy-sensitive applications where sending data to external APIs is not acceptable

When OpenAI embeddings are worth it:

You want zero infrastructure management
Your embedding volume is low (< 10M tokens/month where API cost is under $200)
You need the latest embedding model without evaluating and deploying open source alternatives

Running Embeddings at Scale

For production embedding workloads, batch encoding is critical for throughput:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3")

# Batch encode for efficiency
documents = ["doc1 text...", "doc2 text...", ...]  # thousands of documents
batch_size = 32
embeddings = model.encode(
    documents,
    batch_size=batch_size,
    show_progress_bar=True,
    normalize_embeddings=True  # normalize for cosine similarity
)

For GPU acceleration, sentence-transformers automatically uses CUDA if available. A single A10G GPU (available on Hugging Face Spaces or cloud providers for ~$0.60/hour) can encode 100,000-500,000 sentences per hour depending on model and sentence length.

Keep Reading

Hugging Face Complete Guide — Where to find and download embedding models
LangChain vs LlamaIndex Comparison — Frameworks that use embedding models for RAG
Open Source RAG Stack Guide — Building a complete retrieval pipeline

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Open Source Embedding Models: Which One to Use in 2026

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

How Embedding Models Work

The sentence-transformers Library

Model Comparison

Comparison with OpenAI text-embedding-3-small

Running Embeddings at Scale

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

Open Source Embedding Models: Which One to Use in 2026

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

How Embedding Models Work

The sentence-transformers Library

Model Comparison

Comparison with OpenAI text-embedding-3-small

Running Embeddings at Scale

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Prompt Patterns for Customer Support AI: What Works and What Creates Liability

Prompt Engineering for Research: What LLMs Can and Cannot Do Reliably

The workspace your team
actually needs