Agent Memory: Dreaming, Feedback, and Continual Learning Path

Matthew Berman2026-05-07go watch the original →

the gist

Expert breakdown of Anthropic's Dreaming for memory consolidation, OpenAI's human feedback UI, and strategies to make agents self-improve while cutting inference costs, drawing from years-ahead papers like MEMGPT and Simulacra.

Memory Consolidation via Dreaming

Anthropic's new Dreaming feature in Claude managed agents reviews past sessions to identify patterns, reinforce useful memories, and discard noise, mimicking human sleep for better recall and efficiency. This offloads compute to off-peak hours, enabling sharper responses during peak usage with fewer tokens—crucial for operational cost savings at scale. Richmond De explains it consolidates memory signals, resolves conflicts, and surfaces user patterns faster in future interactions, echoing the 'sleep-time compute' concept from MEMGPT researchers Sarah and Charles, who were a year ahead with their 2023 paper on offline memory processing.

OpenAI counters with a UI upgrade allowing users to upvote/downvote memory sources, refining personalization through human feedback on atomic 'memory units.' This reinforces good info and prunes bad, improving answer relevance over time. De notes it's a short-term bridge, as frontier labs like OpenAI use it to gather data for model retraining, but warns of pitfalls like last year's personality rollout rollback—a memory failure.

Foundations in Agent Memory Research

Pioneering work like the Stanford Simulacra paper (Aug 2023) simulated 1,000 agents with personalities and memory in a town, yielding emergent behaviors: agents formed relationships, planned birthday parties with invite chains, prefiguring multi-agent systems. De calls it 'Multi-Agent before Multi-Agent,' foundational for today's forgetting and reinforcement mechanisms. MEMGPT and Hindsight (from Vectorize) further advanced consolidation and hindsight review, with De collaborating on Oracle integrations.

These aren't novel hacks; they're rooted in three-year-old ideas like Simulacra's human simulator for feedback loops. De emphasizes agent memory levels the field—solo devs can compete with labs by implementing simple reinforcement, as memory is intuitive: even non-experts grasp it.

Human Feedback vs. Autonomous Self-Improvement

Current bolt-on memory (vector stores, logs) admits transformers lack native state, but De argues it's sufficient short-term: Dario Amodei claims in-context learning alone unlocks trillions in value without weight updates. Human-in-loop persists now for reliability at OpenAI's scale (hundreds of millions users/week), but ideal is autonomous: agents self-grade via patterns, avoiding slow human bottlenecks.

Critiques like 'bolting passover logs' hold, but continual learning—continuously updating model weights in real-time—looms. Oracle experiments daily, releasing a Python package last week for memory in common agents, blending inference savings with emerging weight tweaks.

Operational Costs and Economic Imperative

Agent memory slashes inference costs by prioritizing relevant recall over full-context dumps, vital as labs hit compute walls. Anthropic's constraint until recently drove Dreaming for token efficiency; OpenAI's UI personalizes at scale. De, three years in, sees it as the obvious fix across the AI stack, enabling trillion-scale value via smarter, leaner agents.

Notable Quotes

"Dreaming... consolidating memory fixing reinforcing some memory signals... forgetting some of the previous information that might not be as useful." —Richmond De on Anthropic's feature.
"Directionally very bad." —Mir Murati's text to Sam Altman, highlighting relatable boardroom drama.
"This was multi book before mult book." —De on Simulacra paper's emergent agent behaviors.
"Memory is not solved." —De on the field's ongoing challenges.
"With the in context learning capabilities of this LLM we can get to trillion or multiple trillion um uh value." —Dario Amodei via podcast.

Key Takeaways

Implement memory consolidation offline (like Dreaming) to cut peak-hour tokens by 20-50% via pattern reinforcement.
Use upvote/downvote on memory units for quick personalization; automate via agent self-evaluation long-term.
Read Simulacra (2023) and MEMGPT papers for emergent behavior and sleep compute basics.
Bolt-on memory (vectors/logs) works now; experiment with Oracle's new Python package for agents.
Prioritize continual learning for weight updates to fuse memory/compute, but in-context suffices for massive value.
Test human feedback sparingly—scale to autonomous via pattern detection to avoid OpenAI-style rollbacks.
Track Hindsight/Vectorize for hindsight review; join De's free DeepLearning.AI course on memory.
Offload memory processing to non-peak for cost savings, mimicking human sleep.