LLM Embeddings Explained: What They Are and How to Use Them

Embeddings convert text into dense numerical vectors that capture semantic meaning, enabling similarity search and retrieval at scale without running inference on every query.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

8 min read

// tags

#embeddings#vector-search#rag#semantic-search#openai-embeddings

FIG. ART-20

8 min read

“

LLM Embeddings Explained: What They Are and How to Use Them

// reading plan

sections

972

words

min read

// Machine Learning

Building Semantic Search: Finding Results by Meaning, Not Keywords

How semantic search works, embedding-based architecture, pgvector vs ChromaDB, hybrid search with BM25, and cross-encoder re-ranking for better results.

10 min read

// LLM & Language Models

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Embeddings are dense numerical vectors (arrays of floating-point numbers) that represent text in a way that captures semantic meaning. Two pieces of text with similar meaning will have similar embedding vectors, even if they use completely different words. This property enables similarity search, clustering, and retrieval-augmented generation without running expensive LLM inference at query time.

Why Embeddings Matter

Consider a semantic search problem: you have 100,000 support tickets and want to find the ones most similar to a new incoming ticket. With keyword search, you miss synonyms, paraphrases, and conceptually related tickets that use different words. With embeddings, you convert each ticket to a vector once at index time, convert the query to a vector at search time, and find the most similar vectors using cosine similarity. This handles synonyms and paraphrases naturally.

The key economic advantage: you pay for embedding generation once (at index time), and similarity search over millions of vectors is extremely fast and cheap (often milliseconds). Compare this to running GPT-4o inference on every document at query time, which would be prohibitively expensive.

How to Generate Embeddings

OpenAI text-embedding-3-small

The most practical option for most teams:

const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "The quick brown fox jumps over the lazy dog",
});

const embedding = response.data[0].embedding;
// embedding is a number[] with 1536 dimensions

Pricing: $0.02 per 1M tokens (extremely cheap). For most applications, the cost of embedding generation is negligible compared to LLM inference costs.

The larger text-embedding-3-large model produces better embeddings (3072 dimensions) at $0.13 per 1M tokens. For most use cases, the small model is sufficient.

Cohere Embed

Cohere's embed-english-v3.0 model produces strong embeddings optimized for search and retrieval. Priced at $0.10 per 1M tokens. Cohere also offers multilingual embeddings (embed-multilingual-v3.0) which perform well across many languages.

Open Source: sentence-transformers and BGE

For self-hosted embedding generation:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-large-en-v1.5")
embeddings = model.encode(["Text to embed", "Another text"])

BGE (Beijing General Embedding) models from BAAI are among the strongest open source embedding models, with top scores on the MTEB benchmark. BGE-large-en-v1.5 is particularly strong for English retrieval tasks.

Running sentence-transformers locally costs nothing per token and keeps data on your infrastructure.

Embedding Dimensions

Different models produce vectors of different sizes:

OpenAI text-embedding-3-small: 1536 dimensions (or custom dimensions from 256 to 1536)
OpenAI text-embedding-3-large: 3072 dimensions
BGE-large: 1024 dimensions
Cohere embed-english-v3.0: 1024 dimensions

More dimensions generally means more expressive embeddings, but also more storage and slower similarity computation. For most applications, 768-1536 dimensions is the right range.

Similarity Metrics

Three common ways to measure similarity between embedding vectors:

Cosine similarity: measures the angle between vectors. Ranges from -1 to 1, where 1 means identical direction. Best for most NLP tasks because it is invariant to vector magnitude.

function cosineSimilarity(a: number[], b: number[]): number {
  const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
  const magA = Math.sqrt(a.reduce((sum, val) => sum + val * val, 0));
  const magB = Math.sqrt(b.reduce((sum, val) => sum + val * val, 0));
  return dot / (magA * magB);
}

Dot product: for normalized vectors (unit length), identical to cosine similarity. Faster to compute. OpenAI embeddings are normalized, so dot product works well.

Euclidean distance: straight-line distance between vectors. Less common for text similarity because it is sensitive to magnitude differences that do not reflect semantic similarity.

Practical Applications

Semantic Search

The most common embedding use case. Embed your document corpus once, store vectors in a vector database (Pinecone, Weaviate, pgvector, Chroma), embed queries at search time, return the top-k most similar documents.

Duplicate Detection

Embed all items and find pairs with cosine similarity above a threshold (e.g., 0.95). Works for finding near-duplicate support tickets, product descriptions, or articles even when they are not textually identical.

Recommendation Systems

"Users who liked X also liked Y" can be implemented by embedding items and finding similar items in embedding space. More sophisticated than collaborative filtering for cold-start problems.

Retrieval-Augmented Generation (RAG)

The primary use of embeddings in LLM applications. Instead of fitting all your knowledge into the context window, you:

Embed your knowledge base at index time
At query time, embed the user's question
Retrieve the top-k most relevant chunks
Include those chunks in the LLM's context
Generate a response grounded in retrieved information

This solves the knowledge cutoff problem and keeps LLM inference costs manageable.

MTEB Benchmark: Where Different Models Rank

The Massive Text Embedding Benchmark (MTEB) is the standard evaluation for embedding models, covering 56 tasks across 8 task categories. The Hugging Face MTEB Leaderboard (huggingface.co/spaces/mteb/leaderboard) maintains current rankings.

As of late 2024, the top performers include:

Commercial: OpenAI text-embedding-3-large, Cohere embed-english-v3.0
Open source: BAAI/bge-large-en-v1.5, E5-large-v2, GTE-large

For production use cases, check MTEB for the specific task category that matches your use case (retrieval, clustering, classification, etc.) rather than looking only at the overall score.

Performance Considerations

Batch your embedding requests. Instead of embedding one document at a time:

// Slow: one request per document
for (const doc of documents) {
  await openai.embeddings.create({ model, input: doc });
}

// Fast: batch up to 2048 inputs per request
const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: documents, // array of strings
});

Store embeddings, do not recompute them. Embeddings for static content should be computed once and stored in a vector database. Recomputing on every request wastes money and adds latency.

Keep Reading

Function Calling in LLMs — Another core primitive for LLM applications
How Large Language Models Work: Complete Guide — Understanding the architecture behind embeddings
Cutting LLM API Costs: Complete Guide — How embeddings fit into a cost-optimized LLM stack

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

LLM Embeddings Explained: What They Are and How to Use Them

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

Why Embeddings Matter

How to Generate Embeddings

OpenAI text-embedding-3-small

Cohere Embed

Open Source: sentence-transformers and BGE

Embedding Dimensions

Similarity Metrics

Practical Applications

Semantic Search

Duplicate Detection

Recommendation Systems

Retrieval-Augmented Generation (RAG)

MTEB Benchmark: Where Different Models Rank

Performance Considerations

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Embeddings Explained: What They Are and How to Use Them

Related Articles

Building Semantic Search: Finding Results by Meaning, Not Keywords

Why Embeddings Matter

How to Generate Embeddings

OpenAI text-embedding-3-small

Cohere Embed

Open Source: sentence-transformers and BGE

Embedding Dimensions

Similarity Metrics

Practical Applications

Semantic Search

Duplicate Detection

Recommendation Systems

Retrieval-Augmented Generation (RAG)

MTEB Benchmark: Where Different Models Rank

Performance Considerations

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

The workspace your team
actually needs