Memory
Manage conversation history and context with agents
Concept
Memory is a core component of agentic systems. It allows you to store and retrieve information from the past.
In LlamaIndexTS, you can create memory by using the createMemory
function. This function will return a Memory
object, which you can then use to store and retrieve information.
As the agent runs, it will make calls to add()
to store information, and get()
to retrieve information.
Usage
A Memory
object has both short-term memory (i.e. a FIFO queue of messages) and optionally long-term memory (i.e. extracting information over time).
get()
always returns all messages stored in the memory. The longer the agent runs, this will exceed the context window of the agent. To avoid this, the agent is using the getLLM
method to get the last X messages that fit into the context window.
Configuring Memory for an Agent
Here we're creating a memory with a static block (read more about memory blocks) that contains some information about the user.
import { openai } from "@llamaindex/openai";
import { agent } from "@llamaindex/workflow";
import { createMemory, staticBlock } from "llamaindex";
const llm = openai({ model: "gpt-4.1-mini" });
// Create memory with predefined context
const memory = createMemory({
memoryBlocks: [
staticBlock({
content:
"The user is a software engineer who loves TypeScript and LlamaIndex.",
}),
],
});
// Create an agent with the memory
const workflow = agent({
name: "assistant",
llm,
memory,
});
const result = await workflow.run("What is my name?");
console.log("Response:", result.data.result);
Using Vercel format
You can also put messages in Vercel format directly to the memory:
await memory.add({
id: "1",
createdAt: new Date(),
role: "user",
content: "Hello!",
options: {
parts: [
{
type: "file",
data: "base64...",
mimeType: "image/png",
},
],
},
});
If you call get
, messages are usually retrieved in the LlamaIndexTS format (type ChatMessage
). If you specify the type
parameter using get
, you can return the messages in different formats. E.g.: using type: "vercel"
, you can return the messages in Vercel format:
const messages = await memory.get({ type: "vercel" });
console.log(messages);
Customizing Memory
Short-Term Memory
The Memory
object will store all the messages that are added to the Memory
object. Unless you call clear()
, no messages are removed from the memory. This is the short-term memory (usually you will store the memory of one user session there) which is augmented by the long-term memory.
Calling getLLM
will retrieve messages from long-term memory and ensure that the given tokenLimit
is not reached. These are the messages that you will sent to the LLM.
For initialization, you call createMemory
with the following options:
tokenLimit
: Maximum tokens for memory retrieval usinggetLLM
(default: 30000).shortTermTokenLimitRatio
: Ratio of tokens for short-term vs long-term memory (default: 0.7)customAdapters
: Custom message adapters for different message formats. LlamaIndex (ChatMessageAdapter
) and Vercel (VercelMessageAdapter
) are built-in adapters.memoryBlocks
: Memory blocks for long-term storage, see Long-Term Memory
Example:
const memory = createMemory({
tokenLimit=40000,
shortTermTokenLimitRatio=0.5,
});
Long-Term Memory
Long-term memory is represented as Memory Block
objects. These objects contain information that are from previous user sessions or from the beginning of the current conversation. When memory is retrieved (by calling getLLM
), the short-term and long-term memories are merged together within the given tokenLimit
.
Currently, there are three predefined memory blocks:
staticBlock
: A memory block that stores a static piece of information.factExtractionBlock
: A memory block that extracts facts from the chat history.vectorBlock
: A memory block that stores and retrieves chat messages from a vector database using semantic similarity search. Messages are stored individually and retrieved based on their relevance to recent conversation context. Here we've passed in thevectorStore
to use to store and retrieve the chat messages.
This sounds a bit complicated, but it's actually quite simple. Let's look at an example:
import { createMemory, factExtractionBlock, staticBlock, vectorBlock } from "llamaindex";
import { QdrantVectorStore } from "@llamaindex/qdrant";
import { OpenAIEmbedding } from "@llamaindex/openai";
const memoryBlocks= [
staticBlock({
content: "My name is Logan, and I live in Saskatoon. I work at LlamaIndex.",
}),
factExtractionBlock({
priority: 1,
llm: llm,
maxFacts: 50,
}),
vectorBlock({
vectorStore: new QdrantVectorStore({ url: "http://localhost:6333" }),
priority: 2,
}),
];
Here, we've setup three memory blocks:
staticBlock
: A static memory block that stores some core information about the user. This information will always be inserted into the memory. The type used isMessageContent
to support multi-modal content.factExtractionBlock
: An extracted memory block that will extract information from the chat history. Here we've passed in thellm
to use to extract facts from the chat history, and set themaxFacts
to 50. If the number of extracted facts exceeds this limit, themaxFacts
will be automatically summarized and reduced to leave room for new information.vectorBlock
: A vector memory block that will store in a vector database and retrieve them from there. Messages are stored individually and retrieved based on their relevance to recent conversation context. Here we've passed in thevectorStore
to use to store and retrieve the chat messages.
You'll also notice that we've set the priority
for the factExtractionBlock
block. This is used to determine the handling when the memory blocks content (i.e. long-term memory) + short-term memory exceeds the token limit on the Memory
object.
priority=0
: This block will always be kept in memory (staticBlocks
always have priority 0.)priority=1, 2, 3, etc
: This determines the order in which memory blocks are truncated when the memory exceeds the token limit, to help the overall short-term memory + long-term memory content be less than or equal to thetokenLimit
.
Now, let's pass these blocks into the createMemory
function:
const memory = createMemory({
tokenLimit: 40000,
memoryBlocks: memoryBlocks,
)
When memory is retrieved (using getLLM
), the short-term and long-term memories are merged together. The Memory
object will ensure that the short-term memory + long-term memory content is less than or equal to the tokenLimit
. If it is longer, messages are retrieved in the following order:
- StaticMemoryBlock (information always included)
- LongTermMemoryBlock (depending on priority)
- ShortTermMemoryBlock
- Transient messages
The amount of short-term memory included is specified by the shortTermTokenLimitRatio
. If it's set to 0.7
, 70% of the tokenLimit
is used for short-term memory (not including the static memory block).
VectorBlock Configuration Options
The vectorBlock
offers several configuration options to customize its behavior:
vectorBlock({
vectorStore: new QdrantVectorStore({ url: "http://localhost:6333" }),
priority: 2,
retrievalContextWindow: 5, // Number of recent messages to use for context when retrieving
formatTemplate: new PromptTemplate({ template: "Context: {{ context }}" }), // Custom formatting template
nodePostprocessors: [/* custom postprocessors */], // Apply processing to retrieved nodes
queryOptions: {
similarityTopK: 3, // Number of top similar results to return (default: 2)
mode: VectorStoreQueryMode.DEFAULT, // Query mode for the vector store
sessionFilterKey: "session_id", // Metadata key for session filtering (default: "session_id")
// Custom filters can be added here - session filter is automatically included
filters: {
filters: [
{ key: "custom_field", value: "custom_value", operator: "==" }
],
condition: "and"
}
}
})
Key Configuration Options:
retrievalContextWindow
: Number of recent messages to consider when creating the retrieval query (default: 5). A larger window provides more context but may be less precise.formatTemplate
: Template for formatting retrieved information before adding to memory. Defaults to a simple context template.nodePostprocessors
: Array of postprocessors to apply to retrieved nodes, useful for filtering or transforming results.queryOptions.similarityTopK
: Number of most similar messages to retrieve from the vector store (default: 2).queryOptions.sessionFilterKey
: Metadata key used to isolate memory between different sessions (default: "session_id").queryOptions.filters
: Additional metadata filters for retrieval. The session filter is automatically added to ensure memory isolation.
Session Isolation:
The vectorBlock automatically adds a session filter using the block's ID to ensure that memories from different sessions don't interfere with each other. This filter uses the sessionFilterKey
(default: "session_id") and can be customized if needed.
Persistence with Snapshots
Save and restore memory state:
import { createMemory, loadMemory } from "llamaindex";
const memory = createMemory();
// Add some messages
await memory.add({ role: "user", content: "Hello!" });
// Create snapshot
const snapshot = memory.snapshot();
// Later, restore from the snapshot
const restoredMemory = loadMemory(snapshot);
Examples
Want to learn more about the Memory class? Check out our example codes in Github.
Last updated on