TL;DR
This paper introduces an adaptive, data-driven memory framework for LLM-based agents that models memory cycles to enhance memorization and performance in specific environments, addressing limitations of manual memory mechanisms.
Contribution
It proposes a novel memory optimization framework with an MoE gate, learnable aggregation, and task-specific reflection, enabling more effective memorization in LLM agents.
Findings
Improved memory utilization in LLM agents.
Enhanced performance in environment-specific tasks.
Effective modeling of memory cycles improves agent adaptability.
Abstract
LLM-based agents have been extensively applied across various domains, where memory stands out as one of their most essential capabilities. Previous memory mechanisms of LLM-based agents are manually predefined by human experts, leading to higher labor costs and suboptimal performance. In addition, these methods overlook the memory cycle effect in interactive scenarios, which is critical to optimizing LLM-based agents for specific environments. To address these challenges, in this paper, we propose to optimize LLM-based agents with an adaptive and data-driven memory framework by modeling memory cycles. Specifically, we design an MoE gate function to facilitate memory retrieval, propose a learnable aggregation process to improve memory utilization, and develop task-specific reflection to adapt memory storage. Our memory framework empowers LLM-based agents to learn how to memorize…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper is relatively easy to read. 2. It addresses an important problem - memory mechanism in LLM-based agents. 3. A large number of diverse baselines are used in the experiments.
1. The Related Works section, especially the part on reinforcement learning, is too general. The purpose of such sections is to position the presented work relative to the most relevant existing studies, highlighting its novelty and significance. General information should instead be placed in the Background section. 2. The presentation of results in Table 1, split into two blocks for each dataset, is not convenient for readability. It would be better to display all results in a single block. 3.
(1) A learned MoE gate weights relevance/recency/importance/emotion per query instead of fixed cosine-only ranking. (2) Storage uses learned reflection to write task-specific memories, improving future retrieval precision. (3) An on-policy phase realigns retrieval/usage/storage with the agent’s own trajectories, reducing distribution shift and stabilizing the full loop.
1. In terms of the novelty, feeling more like a solid systems integration than a conceptual leap. 2. Main tables lack significance test, hard to judge variance, especially on easier splits. 3. Results are strongest on HotpotQA. A compact test on a different agent task (e.g., web-acting, tool use) would help generality. 4. Off-policy data construction and the supervision sources for SFT/DPO could be described more precisely to assess bias. 5. This paper suggests naive combinations can degrade
**1. Complex Method Design with Concrete Problem Formulation.** > The paper introduces all main components in the pipeline and presents mathematical formulations for them.
**1. Lack of Novelty in the Proposed Method.** > Unlike what’s stated in the paper, there have been many works that consider the “memory cycle effect” by inducing, verifying, retrieving, and using memory entries [1,2]. Beyond covering non-paramatric update approaches as what’s being proposed in this work [3], some works also explore parametric updates [4]. There are many memory-related works besides what’s referenced in this comment. That being said, it is unclear what is the unique method this
1. The motivation of learning to memorize is a promising direction for LLM-based decision making. 2. The proposed memory cycle effect is a good concept that the memory storage, retrieval, and utilization procedures influence each other. 3. The authors test their framework with different tasks and LLMs. The ablation study also demonstrates the effect of each part.
1. This whole framework is quite complex while the benefits of learnable memory are not significant enough for GPT-4o-mini and Qwen-2.5. 2. Many details such as design choices are missing in the context. For example, why do the authors use the Bernoulli distribution for the stop signal in memory utilization? 3. It seems that the training of this adaptive framework is costly, but no details are given.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
