Embeddings are dense numerical vectors (arrays of floating-point numbers) that represent text in a way that captures semantic meaning. Two pieces of text with similar meaning will have similar embedding vectors, even if they use completely different words. This property enables similarity search, clustering, and retrieval-augmented generation without running expensive LLM inference at query time.
Why Embeddings Matter
Consider a semantic search problem: you have 100,000 support tickets and want to find the ones most similar to a new incoming ticket. With keyword search, you miss synonyms, paraphrases, and conceptually related tickets that use different words. With embeddings, you convert each ticket to a vector once at index time, convert the query to a vector at search time, and find the most similar vectors using cosine similarity. This handles synonyms and paraphrases naturally.
The key economic advantage: you pay for embedding generation once (at index time), and similarity search over millions of vectors is extremely fast and cheap (often milliseconds). Compare this to running GPT-4o inference on every document at query time, which would be prohibitively expensive.
How to Generate Embeddings
OpenAI text-embedding-3-small
The most practical option for most teams:
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: "The quick brown fox jumps over the lazy dog",
});
const embedding = response.data[0].embedding;
// embedding is a number[] with 1536 dimensions
Pricing: $0.02 per 1M tokens (extremely cheap). For most applications, the cost of embedding generation is negligible compared to LLM inference costs.
The larger text-embedding-3-large model produces better embeddings (3072 dimensions) at $0.13 per 1M tokens. For most use cases, the small model is sufficient.
Cohere Embed
Cohere's embed-english-v3.0 model produces strong embeddings optimized for search and retrieval. Priced at $0.10 per 1M tokens. Cohere also offers multilingual embeddings (embed-multilingual-v3.0) which perform well across many languages.
Open Source: sentence-transformers and BGE
For self-hosted embedding generation:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(["Text to embed", "Another text"])
BGE (Beijing General Embedding) models from BAAI are among the strongest open source embedding models, with top scores on the MTEB benchmark. BGE-large-en-v1.5 is particularly strong for English retrieval tasks.
Running sentence-transformers locally costs nothing per token and keeps data on your infrastructure.
Embedding Dimensions
Different models produce vectors of different sizes:
- OpenAI text-embedding-3-small: 1536 dimensions (or custom dimensions from 256 to 1536)
- OpenAI text-embedding-3-large: 3072 dimensions
- BGE-large: 1024 dimensions
- Cohere embed-english-v3.0: 1024 dimensions
More dimensions generally means more expressive embeddings, but also more storage and slower similarity computation. For most applications, 768-1536 dimensions is the right range.
Similarity Metrics
Three common ways to measure similarity between embedding vectors:
Cosine similarity: measures the angle between vectors. Ranges from -1 to 1, where 1 means identical direction. Best for most NLP tasks because it is invariant to vector magnitude.
function cosineSimilarity(a: number[], b: number[]): number {
const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
return dot / (magA * magB);
}
Dot product: for normalized vectors (unit length), identical to cosine similarity. Faster to compute. OpenAI embeddings are normalized, so dot product works well.
Euclidean distance: straight-line distance between vectors. Less common for text similarity because it is sensitive to magnitude differences that do not reflect semantic similarity.
Practical Applications
Semantic Search
The most common embedding use case. Embed your document corpus once, store vectors in a vector database (Pinecone, Weaviate, pgvector, Chroma), embed queries at search time, return the top-k most similar documents.
Duplicate Detection
Embed all items and find pairs with cosine similarity above a threshold (e.g., 0.95). Works for finding near-duplicate support tickets, product descriptions, or articles even when they are not textually identical.
Recommendation Systems
"Users who liked X also liked Y" can be implemented by embedding items and finding similar items in embedding space. More sophisticated than collaborative filtering for cold-start problems.
Retrieval-Augmented Generation (RAG)
The primary use of embeddings in LLM applications. Instead of fitting all your knowledge into the context window, you:
- Embed your knowledge base at index time
- At query time, embed the user's question
- Retrieve the top-k most relevant chunks
- Include those chunks in the LLM's context
- Generate a response grounded in retrieved information
This solves the knowledge cutoff problem and keeps LLM inference costs manageable.
MTEB Benchmark: Where Different Models Rank
The Massive Text Embedding Benchmark (MTEB) is the standard evaluation for embedding models, covering 56 tasks across 8 task categories. The Hugging Face MTEB Leaderboard (huggingface.co/spaces/mteb/leaderboard) maintains current rankings.
As of late 2024, the top performers include:
- Commercial: OpenAI text-embedding-3-large, Cohere embed-english-v3.0
- Open source: BAAI/bge-large-en-v1.5, E5-large-v2, GTE-large
For production use cases, check MTEB for the specific task category that matches your use case (retrieval, clustering, classification, etc.) rather than looking only at the overall score.
Performance Considerations
Batch your embedding requests. Instead of embedding one document at a time:
// Slow: one request per document
for (const doc of documents) {
await openai.embeddings.create({ model, input: doc });
}
// Fast: batch up to 2048 inputs per request
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: documents, // array of strings
});
Store embeddings, do not recompute them. Embeddings for static content should be computed once and stored in a vector database. Recomputing on every request wastes money and adds latency.
Keep Reading
- Function Calling in LLMs — Another core primitive for LLM applications
- How Large Language Models Work: Complete Guide — Understanding the architecture behind embeddings
- Cutting LLM API Costs: Complete Guide — How embeddings fit into a cost-optimized LLM stack
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.