Intrinsic Memory Agents: Heterogeneous Multi-Agent LLM Systems through Structured Contextual Memory
Sizhe Yuen, Francisco Gomez Medina, Ting Su, Yali Du, Adam J. Sobey

TL;DR
This paper introduces Intrinsic Memory Agents, a new framework for multi-agent LLM systems that uses agent-specific, evolving memories to overcome context limitations, improving performance and consistency across various tasks.
Contribution
The paper presents a novel intrinsic memory framework that maintains role-aligned, evolving memories without handcrafted prompts, enhancing multi-agent LLM system capabilities.
Findings
Achieves state-of-the-art performance on PDDL, FEVER, and ALFWorld datasets.
Demonstrates higher quality designs in complex data pipeline tasks.
Shows improved scalability, reliability, and cost-effectiveness.
Abstract
Multi-agent systems built on Large Language Models (LLMs) show exceptional promise for complex collaborative problem-solving, yet they face fundamental challenges stemming from context window limitations that impair memory consistency, role adherence, and procedural integrity. This paper introduces Intrinsic Memory Agents, a novel framework that addresses these limitations through agent-specific memories that evolve intrinsically with agent outputs. Specifically, our method maintains role-aligned memory that preserves specialized perspectives while focusing on task-relevant information. Our approach utilises a generic memory template applicable to new problems without the need to hand-craft specific memory prompts. We benchmark our approach on the PDDL, FEVER, and ALFWorld datasets, comparing its performance to existing state-of-the-art multi-agentic memory approaches and showing…
Peer Reviews
Decision·Submitted to ICLR 2026
- **Novel and Relevant Methodology:** The authors do a good job of identifying a real-world limitation of multi-agent systems: how homogeneous memory causes agents to lose their specialized perspectives. Their proposed "Intrinsic Memory" with "Structured Templates" is a novel and well-thought-out solution to this problem. This approach seems to directly target the core challenges of role adherence and perspective inconsistency. - **Rigorous Case Study Evaluation:** The data pipeline design case
- **Manual Template Engineering and Limited Generality:** The method's biggest weakness is its reliance on "manually created," task-specific templates. The authors acknowledge this in Section 6. This reliance on expert human effort means the framework can't be easily generalized to new tasks or different agent roles, which severely limits its practical, out-of-the-box utility. - **Methodological Flaw in PDDL Benchmark:** I am not convinced by the conclusions from the PDDL benchmark (Section 4).
1. Clear motivation and problem framing: The paper clearly articulates the problem it aims to solve and provides a strong rationale for its approach. 2. Clear and organized presentation: The paper is well-organized and uses language that is easy to understand, making the core concepts accessible. 3. Provides detailed experimental details for reproducibility: The inclusion of specific details, such as the prompts used (Appendix B), is commendable and supports the reproducibility of the work.
1. Limited experimental evaluation: The evaluation is confined to the PDDL benchmark and a single case study. In contrast, other significant works in this area (e.g., G-Memory) are often evaluated on a more diverse set of benchmarks spanning multiple domains. 2. Limited statistical robustness of the PDDL benchmark: The PDDL benchmark results are based on a single run with a set seed (Sec 4). This severely limits the statistical significance of the results presented in Table 1 and makes it diffi
The proposed output-driven memory evolution offers a specialized solution to the context window problem in MAS. This mechanism helps maintain role adherence by constantly refreshing agent-specific context, which benefits the performance of MAS on two downstream tasks.
1. The core mechanism of IMA—refining a structured memory template based on self-output, primarily operates at the prompt engineering level. The design lacks a distinct technical contribution, relying entirely on an LLM's internal reasoning step. The authors must rigorously justify why this intrinsic, prompt-based refinement constitutes a novel advance beyond existing context management techniques. 2. The experimental validation is limited to a specific set of collaborative tasks, which restric
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI-based Problem Solving and Planning · Multi-Agent Systems and Negotiation · Multimodal Machine Learning Applications
