MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang

TL;DR
MEM1 introduces a reinforcement learning framework enabling long-horizon language agents to operate with constant memory, improving efficiency and reasoning performance across various multi-turn tasks by consolidating relevant information and discarding redundancies.
Contribution
The paper presents MEM1, a novel RL-based method for memory management in long-horizon agents, allowing scalable, constant-memory operation and improved reasoning in complex multi-turn environments.
Findings
MEM1-7B improves performance by 3.5x over baseline.
MEM1 reduces memory usage by 3.7x.
The approach generalizes beyond training horizons.
Abstract
Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional…
Peer Reviews
Decision·ICLR 2026 Poster
- The problem is well motivated, targeting a critical bottleneck of using LLM agents for real-world problems. Curbing the unbounded context growth finds many applications in AI agent applications, including deep research, web agents, game playing agents, etc.. - The proposed method is simple but effective. MEM1 encourages the AI agent to synergise reasoning and memory consolidation by designing a clever rollout mechanism in an RL pipeline. The simplicity and end-to-end nature also implies the po
- MEM1’s reliance on verifiable and dense reward signals limits its applicability to open-ended or subjective tasks. Many realistic LLM-agent settings (e.g., creative reasoning and open-ended QA) lack such clear supervision. It is interesting how it can be extended to cases where the rewards are more implicit. - Some presentation issues: (1)The naming is a bit messy. The paper has used “long-turn”, “long-horizon”, and “multi-turn” throughout. Do they mean the same thing? If so, the authors shoul
- **Clear, Well-Structured Presentation**: The technical motivation, algorithm, and evaluation methodology are clearly described. Figures such as Figure 1 (“RL pipeline“) and Figure 2 ("performance and efficiency scaling") directly help crystallize the approach and the empirical insights. - **Sound, End-to-End RL Optimization**: The use of reinforcement learning to train both reasoning and memory management jointly is well argued and empirically shown to benefit generalization to longer, more c
- **Evaluation Reflects Synthetic Compositions, Not Open-Ended Dialogue.** Benchmarks are mainly constructed by composing QA subsets, which exercises multi-turn reasoning but biases toward compositional templates. Even WebShop, though interactive, is governed by predefined tasks, constrained action spaces, and scripted reward assumptions, so the current setup under-represents genuinely open-ended interactions with ambiguous goals or shifting task boundaries. - **Insufficient Direct Comparison
1. The paper proposes to treat memory as part of the policy and learn it jointly with reasoning, instead of relying on an external memory module or retrieval system. This is a refreshing take on long-horizon agents and feels conceptually meaningful: the model actually learns what to remember rather than being manually engineered to store past information. 2. Existing LLM agents don’t scale well over long contexts because their memory grows linearly (or worse) with interaction steps. MEM1 tackle
1. EM reward design lacks ablation study and maybe limits real world applicability . Although the paper adopts EM as the sole reward signal during RL training for QA tasks, it does not use ablation study to analyze or justify this specific reward choice. For example, they do not compare EM with other potential reward signals such as token-level F1, partial matching, or step-wise retrieval rewards that might better capture intermediate reasoning quality. Also, the assumption of verifiable reward
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
