Your AI companion can hold a brilliant conversation for ten minutes and then forget everything. That is the default experience for most deployed bots in 2026, and it is the single biggest reason users abandon them. Memory is what separates a novelty chatbot from a companion people actually rely on. The good news: implementing memory is no longer a research problem. It is an engineering decision with clear tradeoffs in cost, latency, and complexity.
This guide walks through three practical memory architectures you can add to any conversational AI companion today. Whether you are running a Telegram bot, a Slack coworker, or a Discord community assistant, the patterns are the same. We will cover buffer memory, conversation summarization, and vector-based long-term recall, with real cost numbers and code examples. If you are new to deploying AI assistants, start with our Telegram deployment tutorial first, then come back here to add memory.
Why Memory Matters More Than Model Choice
Developers obsess over which LLM to use (see our Claude vs GPT comparison) but underinvest in memory infrastructure. The reality is that a mid-tier model with great memory will outperform a frontier model with none. Users expect companions to remember their name, preferences, past decisions, and ongoing projects. Without memory, every conversation starts from zero, and the user has to re-explain context every single time.
Memory also directly impacts your token costs. Without memory management, naive implementations stuff the entire conversation history into every API call. A 50-message thread with a 1,000-token system prompt can easily hit 15,000 input tokens per request. Intelligent memory systems like Mem0 report a 26% improvement in response quality while reducing token usage by over 90%. That is not a marginal optimization. It is the difference between a $300/month API bill and a $30/month one.
The Three Layers of AI Memory
Production memory systems work best as a hierarchy. Each layer serves a different time horizon, and the most robust companions use all three.
Layer 1: Buffer Memory (Immediate Context)
Buffer memory is the simplest form. You keep the last N messages in the conversation and pass them to the LLM on every request. This gives the model awareness of the current session without any additional infrastructure.
// Simple buffer memory: keep last 10 messages
const MAX_BUFFER = 10;
async function handleMessage(userId, newMessage) {
const history = await getHistory(userId);
const buffer = history.slice(-MAX_BUFFER);
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "system", content: systemPrompt },
...buffer,
{ role: "user", content: newMessage }
]
});
await appendHistory(userId, newMessage, response);
return response.choices[0].message.content;
}Buffer memory is fast, free, and requires zero additional infrastructure. The downside is obvious: once the buffer window slides past a message, it is gone. The model has no idea what happened 20 messages ago. For casual chatbots this is fine. For a companion that needs to remember your project deadlines or dietary preferences, it is insufficient.
Layer 2: Summarization Memory (Session Compression)
Summarization memory solves the buffer limitation by periodically compressing older messages into a condensed summary. Instead of passing 50 raw messages, you pass a 200-token summary of messages 1 through 40 plus the raw last 10 messages. This preserves context while keeping token costs predictable.
// Summarization memory with rolling compression
async function buildContext(userId) {
const summary = await getSummary(userId);
const recentMessages = await getRecentMessages(userId, 10);
return [
{ role: "system", content: systemPrompt },
{ role: "system", content: `Previous conversation summary: ${summary}` },
...recentMessages
];
}
// Run after every 20 new messages
async function compressSummary(userId) {
const oldSummary = await getSummary(userId);
const newMessages = await getMessagesSinceSummary(userId);
const updated = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{ role: "system", content: "Compress the following conversation into a concise summary. Preserve key facts, user preferences, and action items." },
{ role: "user", content: `Existing summary: ${oldSummary}\n\nNew messages: ${JSON.stringify(newMessages)}` }
]
});
await saveSummary(userId, updated.choices[0].message.content);
}The common pattern is to summarize everything older than 20 messages while keeping the last 10 verbatim. This balances context preservation with token management. The summarization call itself uses a cheap, fast model (GPT-4o-mini or Claude Haiku) so the overhead is minimal. For most bots handling under 1,000 daily users, the summarization cost is under $1/month.
Layer 3: Vector Memory (Long-Term Recall)
Vector memory is the breakthrough that makes AI companions feel genuinely personal. Instead of compressing everything into a single summary, you embed individual facts, preferences, and conversation snippets into a vector database. When the user sends a new message, you query the vector store for semantically relevant memories and inject them into the prompt.
This is the core of RAG (Retrieval-Augmented Generation) applied to personal memory. The user says "What did we decide about the Q3 budget?" and the vector search retrieves the exact conversation fragment where that decision was made, even if it happened weeks ago.
// Vector memory with Cloudflare Vectorize
import { Ai } from "@cloudflare/ai";
async function storeMemory(env, userId, text) {
const ai = new Ai(env.AI);
const embedding = await ai.run("@cf/baai/bge-small-en-v1.5", {
text: [text]
});
await env.VECTORIZE_INDEX.upsert([{
id: crypto.randomUUID(),
values: embedding.data[0],
metadata: { userId, text, timestamp: Date.now() }
}]);
}
async function recallMemories(env, userId, query, topK = 5) {
const ai = new Ai(env.AI);
const queryEmbedding = await ai.run("@cf/baai/bge-small-en-v1.5", {
text: [query]
});
const results = await env.VECTORIZE_INDEX.query(
queryEmbedding.data[0],
{ topK, filter: { userId }, returnMetadata: "all" }
);
return results.matches.map(m => m.metadata.text);
}The key design decision is what to store. Do not embed raw message text. Instead, extract and embed discrete facts: "User prefers Python over TypeScript," "User's project deadline is March 15," "User is allergic to peanuts." Smaller, fact-oriented embeddings retrieve more accurately than long conversation chunks.
Cost Comparison: Memory Infrastructure in 2026
Memory infrastructure costs vary dramatically depending on scale and provider. Here is a realistic comparison for a companion handling 1,000 daily active users with approximately 10 stored memories per user (10,000 total vectors at 384 dimensions).
| Provider | Free Tier | Paid Starting At | Best For |
|---|---|---|---|
| Cloudflare Vectorize | 5M stored dims, 30M queried dims/mo | $0.01/M queried dims | Edge-native apps on Workers |
| Mem0 | 10K memories, 1K retrievals/mo | $19/mo (50K memories) | Managed memory with zero config |
| Pinecone | Starter plan with limited storage | $70/mo (Standard plan) | Enterprise-scale RAG pipelines |
| Upstash Vector | 10K vectors, 500K queries/mo | $0.40/100K queries | Serverless with pay-per-query |
For companions deployed on Cloudflare Workers (which is what getclaw uses under the hood), Cloudflare Vectorize is the natural choice. At 10,000 vectors with 384 dimensions, you are storing 3.84 million vector dimensions. That fits entirely within the free tier. Even at 100,000 vectors, your monthly cost stays under $1. The tight integration with Workers AI for embedding generation means zero additional network hops.
Mem0 is worth considering if you want a turnkey solution. Their API handles embedding, storage, retrieval, and even automatic fact extraction from conversations. The free tier (10,000 memories, 1,000 retrievals per month) covers most indie projects. Their benchmarks show a 26% improvement in response quality over naive buffer approaches, and their SOC 2 compliance matters if you are handling user data in regulated industries.
Architecture: Putting All Three Layers Together
The strongest memory architecture combines all three layers in a single request pipeline:
- Buffer: Include the last 8 to 10 raw messages for immediate conversational flow.
- Summary: Prepend a compressed summary of the current session for mid-range context.
- Vector recall: Query the vector store with the user's latest message and inject the top 3 to 5 relevant memories.
// Combined three-layer memory pipeline
async function buildFullContext(env, userId, newMessage) {
// Layer 1: Buffer (last 10 messages)
const recentMessages = await getRecentMessages(userId, 10);
// Layer 2: Session summary
const sessionSummary = await getSummary(userId);
// Layer 3: Vector recall
const memories = await recallMemories(env, userId, newMessage, 5);
const memoryBlock = memories.length > 0
? `Relevant memories about this user:\n${memories.join("\n")}`
: "";
return [
{ role: "system", content: systemPrompt },
{ role: "system", content: `Session context: ${sessionSummary}` },
...(memoryBlock ? [{ role: "system", content: memoryBlock }] : []),
...recentMessages,
{ role: "user", content: newMessage }
];
}This pipeline adds roughly 50 to 100ms of latency from the vector query. On Cloudflare's network, Vectorize queries typically resolve in under 30ms because the index lives on the same edge infrastructure as your Worker. Compare that to a cross-region Pinecone query that can add 100 to 200ms. For a Telegram bot where users expect sub-second responses, that difference matters.
Memory Extraction: What to Remember
The most common mistake developers make is storing everything. Not every message deserves a place in long-term memory. Effective memory systems extract and store only high-signal facts.
- User preferences: "Prefers concise answers," "Works in Python," "Located in Berlin."
- Decisions and commitments: "Decided to use PostgreSQL for the project," "Budget approved at $5,000."
- Personal facts: "Has two kids," "Runs a bakery," "Vegetarian."
- Temporal events: "Meeting with Sarah scheduled for March 10," "Product launch on April 1."
You can automate fact extraction with a cheap model. After every conversation turn, pass the exchange to GPT-4o-mini or Claude Haiku with instructions to extract any new facts worth remembering. This typically costs less than $0.001 per extraction call. Only store facts that are novel; skip anything already captured in existing memories.
Privacy and Data Retention
Memory creates a privacy surface. You are now storing personal information about users beyond the ephemeral conversation. Design your memory system with these principles:
- User control: Let users view, edit, and delete their stored memories. Anthropic's Claude recently made memory management free for all users, and your companion should offer the same.
- Automatic expiry: Set TTLs on memories that are time-sensitive. A "meeting tomorrow" memory is useless after that date.
- Namespace isolation: Never let one user's memories leak into another user's context. Filter by user ID on every vector query.
- BYOK compliance: With a BYOK architecture, API keys stay under the user's own account. If you add a vector store, keep it in the same account so you never touch their data directly.
Benchmarking Your Memory System
Before shipping memory to production, measure three metrics:
- Recall accuracy: Ask the companion questions about previously discussed topics. Does it retrieve the right memories? Aim for 80%+ accuracy on a test set of 50 questions.
- Latency overhead: Measure the p95 latency with and without memory retrieval. The delta should be under 100ms for edge deployments.
- Token efficiency: Compare total input tokens per request before and after implementing summarization. You should see a 40 to 60% reduction for active conversations.
Getting Started with getclaw
If you want to skip the infrastructure work, getclaw deploys your AI companion to Cloudflare Workers with a BYOK model, so you bring your own API key and only pay provider rates with zero markup. The bot runtime, webhook handling, and multi-channel routing (Telegram, Slack, Discord) are managed for you. You can then layer memory on top using the patterns in this guide: buffer memory requires no extra infrastructure, summarization runs inside your model calls, and vector recall can be added with Cloudflare Vectorize or any external vector database alongside your Worker.
Read our getting started guide to deploy your first companion in under two minutes, or check the API reference for details on configuration. If you are evaluating platforms, our comparison with Voiceflow and Botpress covers how getclaw differs from no-code builders.
Conclusion
Memory is the feature that transforms a stateless chatbot into a trusted companion. The three-layer approach (buffer, summarization, and vector recall) is production-proven and affordable at any scale. Cloudflare Vectorize makes the infrastructure nearly free for edge-native bots. Managed solutions like Mem0 let you add memory with a single API call. And the cost savings from intelligent memory management often pay for the entire memory infrastructure many times over through reduced token usage.
Start with buffer memory today, add summarization when conversations get long, and graduate to vector recall when users start expecting your companion to remember things from weeks ago. The best time to add memory was when you launched. The second best time is now.
Related posts
Deploy your AI assistant
Create an autonomous AI assistant in minutes.