MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu; Tinghong Chen; Jiangtao Feng; Jiangjie Chen; Weinan Dai; Qiying Yu; Ya-Qin Zhang; Wei-Ying Ma; Jingjing Liu; Mingxuan Wang; Hao Zhou

arXiv:2507.02259·cs.CL·July 4, 2025

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou

PDF

2 Models 2 Datasets 3 Reviews

TL;DR

MemAgent introduces a novel memory-based approach for long-text processing in large language models, enabling efficient extrapolation to extremely long contexts with minimal performance loss.

Contribution

The paper presents MemAgent, a new agent framework that reads and updates memory segments, extending training algorithms for better long-context extrapolation in LLMs.

Findings

01

Extrapolates from 8K to 3.5M context with less than 5% performance loss

02

Achieves over 95% accuracy on 512K RULER test

03

Demonstrates superior long-context capabilities in experiments

Abstract

Despite improvements by length extrapolation, efficient attention and memory modules, handling infinitely long documents with linear complexity without performance degradation during extrapolation remains the ultimate challenge in long-text processing. We directly optimize for long-text tasks in an end-to-end fashion and introduce a novel agent workflow, MemAgent, which reads text in segments and updates the memory using an overwrite strategy. We extend the DAPO algorithm to facilitate training via independent-context multi-conversation generation. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ in 512K RULER test.

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 4Confidence 3

Strengths

1. Memory update is formulated as a reinforcement learning setting. 2. The experiments across RULER-HQA, LongBench-QA, and NIAH show the effectiveness of the proposed method. 3. Detailed ablation study is conducted.

Weaknesses

1. The comparison is weak. There are some other long-context modeling methods such as FocusLLM (FocusLLM: Precise Understanding of Long Context by Dynamic Condensing) and E2LLM (E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning) which also use chunks. Moreover, there are also some memory-based methods such as Mem0. The proposed method should be compared with them. 2. Despite inference efficiency, the RL training process is computationally heavy. The time

Reviewer 02Rating 6Confidence 4

Strengths

1. The agent-based memory workflow is an elegant and practical solution to the long-context problem, sidestepping the quadratic complexity of attention. 2. The experimental results are outstanding. The ability to extrapolate from an 8K training context to a 3.5M token QA task with less than 10% performance drop is great. 3. By design, the method scales linearly with the length of the input document in terms of both time and memory, making it highly efficient for real-world deployment on extremel

Weaknesses

1. The fixed-size memory is the source of the method's efficiency, but it's also a potential bottleneck. For tasks that require synthesizing many disparate pieces of information from across a long document, the model might discard critical information prematurely. The paper could discuss this trade-off more explicitly and analyze failure cases where this occurs. Moreover, I think a better choice is to use variable-sized memory based on the amount of information contained in the context. How to i

Reviewer 03Rating 8Confidence 4

Strengths

1. **Strong Performance:** Extrapolating from an 8K training context to 3.5M tokens shows only minimal performance loss, effectively solves the crucial problem in long-context modeling. The strong results on RULER-HQA, LongBench-QA, and NIAH demonstrate state-of-the-art performance and generalization. 2. **Novel and Effective RL Framework:** The Multi-conv DAPO algorithm is a clever solution to a difficult credit-assignment problem. By propagating the final answer's reward back to all intermedi

Weaknesses

1. Memory management seems to be a new lost-in-the-middle: In the ablation shown in Figure 9, increasing memory size does not seem to improve performance but rather decrease, and so this still demonstrate the model inherent problem with handling long-context (in this case the context window of the memory). 2. The memory length is different for different tasks, and this may seem an additional hyper-parameter which may be costly to tune.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsDialogue-Adaptive Pre-training Objective