Memory Systems
Short-term, long-term, and episodic memory architectures that give agents persistent knowledge.
Why Memory Matters for Agents
Without memory, an AI agent is stateless — every interaction starts from scratch. Memory is what allows an agent to build on previous work, learn from past mistakes, maintain context across long tasks, and provide personalized experiences. It is the difference between an agent that solves a ten-step problem and one that forgets what it did two steps ago.
The challenge is that LLMs have a fixed context window (ranging from 32K to over 1 million tokens in current models, with some like Gemini reaching 2 million). This context window is the only "memory" the model natively has — once information falls out of the window, it is gone. Memory systems are the architectural patterns we use to overcome this fundamental limitation. Even with large context windows, memory systems remain essential for cost efficiency, relevance filtering, and persistence across separate sessions.
Effective memory systems must balance three concerns:
- Relevance — Only inject memory that is relevant to the current task. Irrelevant information wastes tokens and can confuse the model.
- Recency — More recent information is typically more important. Memory systems need strategies for prioritizing recent context.
- Importance — Not all memories are equally valuable. Following Park et al.'s (2023) "Generative Agents" framework, scoring memories by importance (mundane vs. critical) alongside recency and relevance produces significantly better retrieval.
Short-Term Memory (Context Window)
Short-term memory is the information held in the LLM's context window during a single session. This is the most immediate and reliable form of memory because the model has direct access to it during generation. It includes the system prompt, the current conversation history, tool results, and any injected context.
Managing short-term memory effectively means managing the context window. Key strategies include:
- Sliding window — Keep only the most recent N messages. Simple but loses important early context. Works well for casual chat but poorly for complex tasks.
- Summarization — Periodically summarize older messages into a compressed form. "The user asked about their order #1234. I looked it up and found it shipped on Monday. The user then asked about return policy." This preserves key information in fewer tokens.
- Selective pruning — Remove tool call details and intermediate reasoning once they have been resolved, keeping only the conclusions. A 500-token tool result can often be summarized to a 50-token conclusion.
class ShortTermMemory:
def __init__(self, llm, max_tokens=100000):
self.messages = []
self.max_tokens = max_tokens
self.llm = llm # LLM used for summarization
def token_count(self):
return sum(len(m.get("content", "").split()) * 1.3
for m in self.messages) # rough estimate
def add(self, message):
self.messages.append(message)
while self.token_count() > self.max_tokens:
# Summarize oldest messages instead of dropping
oldest = self.messages[:5]
summary = self.llm.summarize(oldest)
self.messages = [summary] + self.messages[5:]
def get_context(self):
return self.messages
Long-Term Memory (Vector Stores)
Long-term memory persists beyond a single session and is typically implemented using vector stores (also called vector databases). The core idea: convert information into numerical embeddings, store them in a vector database, and retrieve the most semantically similar entries when needed.
The retrieval process works like this:
- When new information is worth remembering, embed it as a vector and store it with metadata (timestamp, source, category).
- When the agent needs to recall information, embed the current query as a vector.
- Perform a similarity search (cosine similarity, dot product) to find the most relevant stored memories.
- Inject the retrieved memories into the context window alongside the current conversation.
Popular vector stores include Pinecone, Weaviate, ChromaDB, Qdrant, and pgvector (PostgreSQL extension). Each offers different trade-offs in terms of scale, speed, filtering capabilities, and hosting options.
Long-term memory enables powerful capabilities:
- User preferences — "This user prefers concise responses and always wants code in Python."
- Past interactions — "Last week, we discussed their migration from AWS to GCP."
- Accumulated knowledge — Information the agent has gathered across many sessions can be recalled when relevant.
Episodic and Semantic Memory
Drawing from cognitive science, advanced agent architectures distinguish between episodic memory and semantic memory:
Episodic Memory
Episodic memory stores specific experiences — complete interaction sequences with their context and outcomes. Think of it as the agent's autobiography. "On Tuesday, the user asked me to debug a React component. I found the issue was a stale closure in a useEffect hook. The fix was adding the dependency array. The user confirmed it worked."
This is valuable because it allows the agent to:
- Learn from past successes and failures.
- Recognize similar situations and apply proven strategies.
- Provide continuity across sessions ("Last time we worked on this, we...").
Semantic Memory
Semantic memory stores facts and relationships independent of when they were learned. This is often implemented as a knowledge graph — a network of entities and their relationships. "User X works at Company Y. Company Y uses Python and AWS. Their main product is a SaaS platform."
Knowledge graphs are particularly powerful for agents that need to reason about complex domains with many interconnected entities. They complement vector stores by providing structured, relationship-aware retrieval rather than purely similarity-based search.
Procedural Memory
A third type from cognitive science that maps onto agent architectures is procedural memory — knowledge of how to do things. In agents, this corresponds to the model's fine-tuned weights, system prompts, and learned behavioral patterns. While episodic memory stores what happened and semantic memory stores what is known, procedural memory encodes how the agent should behave — its skills, routines, and interaction patterns.
# Combining episodic and semantic memory
class AgentMemory:
def __init__(self):
self.episodic = VectorStore("episodes") # Past experiences
self.semantic = KnowledgeGraph() # Facts & relations
self.working = [] # Current context
def recall(self, query):
episodes = self.episodic.search(query, top_k=3)
facts = self.semantic.query_related(query)
return {
"relevant_episodes": episodes,
"known_facts": facts,
"current_context": self.working
}
Working Memory and Scratchpads
Working memory is a structured intermediate space where the agent organizes its current thinking. Unlike short-term memory (which is the raw context window), working memory is deliberately structured and curated by the agent itself. Note that in agent engineering, this distinction between “working memory” and “short-term memory” is a practical convention — in cognitive psychology, these terms overlap significantly.
Common working memory patterns include:
Scratchpad — A dedicated section of the prompt where the agent writes intermediate results, partial calculations, and notes to itself. This is especially useful for complex reasoning tasks where the agent needs to track multiple threads of thought.
Task state — A structured object that tracks what the agent has accomplished so far, what remains, and any important intermediate results. This is critical for long-running tasks that span many agent loop iterations.
Reflection buffer — After completing a subtask, the agent reflects on what it learned and writes a summary to its working memory. This helps prevent repeated mistakes and enables more efficient reasoning in subsequent steps.
# Working memory as structured state
working_memory = {
"goal": "Migrate the user database from MySQL to PostgreSQL",
"completed_steps": [
"Analyzed MySQL schema - 12 tables, 3 with foreign keys",
"Generated PostgreSQL DDL for all tables"
],
"current_step": "Writing data migration script",
"blockers": ["Need to handle MySQL ENUM types -> PostgreSQL CHECK constraints"],
"notes": ["User prefers using psycopg2 over SQLAlchemy for this task"]
}
The most effective agent architectures combine all memory types: short-term context for the current turn, working memory for the current task, episodic memory for past experiences, and semantic memory for accumulated knowledge. Each layer serves a different purpose and operates at a different time scale.
Implementation Strategies
When implementing memory for your agent, consider these practical strategies:
Start simple, add complexity as needed. Begin with sliding-window short-term memory and a basic vector store for long-term memory. Only add episodic memory, knowledge graphs, and sophisticated retrieval when you have evidence that simpler approaches are insufficient.
Choose the right embedding model. The quality of your vector store retrieval depends heavily on the embedding model. Models like OpenAI's text-embedding-3-large, Cohere's embed-english-v3.0 (or the multimodal embed-v4.0), or open-source models like BGE-M3 each have different strengths. Test retrieval quality on your specific data before committing.
Design your chunking strategy carefully. How you split information into chunks for storage dramatically affects retrieval quality. Key considerations:
- Chunk size: 200-500 tokens is a good starting range. Too small loses context; too large wastes tokens and dilutes relevance.
- Overlap: Use 10-20% overlap between chunks to avoid splitting important context across chunk boundaries.
- Semantic boundaries: Split on paragraph or section boundaries rather than arbitrary token counts when possible.
Use metadata filtering. Tag stored memories with metadata (timestamp, category, user ID, session ID) and use metadata filters during retrieval. This is often more effective than relying solely on semantic similarity. "Give me memories from this user from the last 7 days" is a metadata filter; "Give me memories about database migration" is a semantic search. Combining both yields the best results.
Key Takeaways
- 1Memory systems overcome the LLM's fixed context window limitation, enabling agents to persist knowledge across steps and sessions.
- 2Short-term memory (context window) is managed through sliding windows, summarization, and selective pruning.
- 3Long-term memory uses vector stores to embed, store, and retrieve information by semantic similarity.
- 4Episodic memory stores specific experiences; semantic memory stores facts and relationships; procedural memory encodes behavioral skills and routines.
- 5Working memory (scratchpads, task state) provides structured intermediate storage for complex reasoning tasks.
- 6Start with simple memory patterns and add complexity only when simpler approaches prove insufficient.