Memory is what separates a capable AI agent from a sophisticated autocomplete. Without memory, every agent session starts from zero. The agent does not know what the user asked yesterday, what tools it already tried, or what decisions were already made. Implementing the right memory architecture for your use case is one of the highest-leverage decisions in agent design.
Why Memory Matters
An agent without memory has a hard ceiling on what tasks it can complete. Anything that requires more than a single context window fails. Anything that spans multiple sessions fails. Anything where the agent needs to learn from past mistakes fails.
The consequences are visible in production: agents that ask the same clarifying questions repeatedly, agents that retry failed approaches without recognizing they already tried them, agents that lose track of a long-form goal halfway through execution. These are memory failures, not intelligence failures.
Short-Term Memory: The Context Window
Short-term memory is the agent's context window. Everything in the current conversation, including the system prompt, user messages, tool call results, and assistant responses, lives in short-term memory. It is fast, always available, and requires no retrieval.
The constraint is context size. As of 2026, frontier models support 128k to 200k tokens, which is large but not unlimited. A long conversation with many tool calls will eventually overflow the context. When it does, the model either truncates early messages (losing history) or errors out.
Strategies for managing short-term memory:
Sliding window: keep only the most recent N turns. Simple and deterministic. Loses old context.
Token budget management: track tokens consumed and compress when approaching the limit. More sophisticated but requires careful implementation.
Selective retention: tag certain messages as important (key decisions, task goals) and always keep them regardless of age. Discard less important messages when space is tight.
Short-term memory is appropriate for: single-session tasks, conversations short enough to fit comfortably in context, workflows where history beyond the last few turns is irrelevant.
Long-Term Memory: Persistence Across Sessions
Long-term memory survives session boundaries. It is stored externally and retrieved when relevant. Two primary approaches exist: vector database storage and key-value storage.
Vector database storage embeds memories as vectors and retrieves them by semantic similarity. When a new conversation begins, the agent embeds the user's first message and queries the vector store for related past memories. This handles unstructured memories well ("the user mentioned they prefer Python over JavaScript") but requires an embedding step on write and a retrieval step on read.
Implementation pattern:
import openai
from qdrant_client import QdrantClient
client = QdrantClient("localhost", port=6333)
def save_memory(user_id: str, memory_text: str):
embedding = openai.embeddings.create(
input=memory_text,
model="text-embedding-3-small"
).data[0].embedding
client.upsert(
collection_name="agent_memories",
points=[{"id": generate_id(), "vector": embedding,
"payload": {"user_id": user_id, "text": memory_text}}]
)
def retrieve_memories(user_id: str, query: str, top_k: int = 5):
query_embedding = openai.embeddings.create(
input=query, model="text-embedding-3-small"
).data[0].embedding
results = client.search(
collection_name="agent_memories",
query_vector=query_embedding,
query_filter={"must": [{"key": "user_id", "match": {"value": user_id}}]},
limit=top_k
)
return [r.payload["text"] for r in results]
Key-value storage stores explicit facts in structured form. User preferences, confirmed decisions, and extracted entities go into a dictionary keyed by category. Retrieval is deterministic but requires knowing which keys to look up. Redis works well here.
Long-term memory is appropriate for: personal assistants that need to remember user preferences, support agents that need customer history, any agent that needs continuity across sessions.
Episodic Memory: What Happened, When, and With What Outcome
Episodic memory is memory of past events, not just facts. It captures the trajectory of past interactions: what the user asked, what the agent did, what tools it used, and whether the outcome was successful.
This is the memory type that lets an agent learn. If a particular tool consistently fails for a certain query pattern, episodic memory lets the agent recognize that pattern and try a different approach. If a user rejected a certain type of response last week, episodic memory surfaces that fact before the agent makes the same mistake.
Episodic memory is harder to implement than the other two types because it requires structured logging of agent behavior, not just text. A minimal episodic memory record contains:
- Timestamp
- Task description
- Steps taken (tool calls, reasoning steps)
- Outcome (success, failure, user feedback)
- Key learnings (extracted by a summarization step)
The key learnings field is important. Raw trajectories are expensive to store and slow to retrieve. A post-episode summarization step condenses "the agent tried three different search queries before finding the right document" into "semantic search works better than keyword search for this user's document corpus."
The Memory-Context Trade-Off
More memory loaded into context means better continuity but higher token cost. This trade-off has to be managed deliberately.
The naive approach is to load all available memory at the start of every session. For a new user this is fine. For a user with 200 past sessions, you are now spending thousands of tokens on memory before the user has said anything.
The better approach is lazy loading: load only the memories that are relevant to the current query. Embed the user's first message, retrieve the top 5 most similar memories, and inject only those. As the conversation develops and new topics arise, retrieve additional memories on demand.
A further refinement is memory tiering: keep a short summary of the user's most important facts always loaded (preferences, confirmed decisions, long-term goals), and pull detailed episodic memories only when the task matches a past experience.
Practical Implementation Recommendations
For most production agents, a three-layer stack works:
- Short-term: the full conversation history in context, managed with a token budget.
- Long-term: user facts and preferences in a vector store, loaded at session start via similarity retrieval.
- Episodic: task outcomes logged to a structured store, retrieved when a new task matches a past task type.
Start with just short-term memory. Add long-term memory when users report the agent forgetting important preferences. Add episodic memory when the agent repeats failure patterns it should have learned from.
Memory is infrastructure. Build it incrementally rather than designing the full system upfront.
Keep Reading
- How to Build an AI Agent — foundational agent architecture before adding memory
- Running AI Agents in Production — what breaks in production, including memory edge cases
- Advanced RAG: Beyond Basic Chunk Retrieval — retrieval patterns that apply directly to long-term memory systems
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.