AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang; Dongyu Ru; Lin Qiu; Yiyang Li; Xuezhi Cao; Yangqiu Song; Xunliang Cai

arXiv:2603.01966·cs.CL·March 3, 2026

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, Xunliang Cai

PDF

Open Access 3 Reviews

TL;DR

AMemGym is an interactive benchmarking environment designed to evaluate and improve memory management in long-horizon conversational assistants, addressing limitations of static data approaches and enabling on-policy assessment.

Contribution

We introduce AMemGym, a novel environment for on-policy evaluation and optimization of memory in conversational agents, incorporating structured data sampling and simulated user interactions.

Findings

01

Existing memory systems show significant performance gaps.

02

AMemGym effectively differentiates between memory approaches.

03

Structured evaluation guides memory system improvements.

Abstract

Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

* The paper is grounded on a novel motivation: existing benchmarks mostly focus on off-policy memory evaluation which might have a gap between realistic systems. * The introduction of write/read/utilization decomposition is a meaningful contribution. * The paper also conducted extensive comparisons that shed light on long-context degradation and memory design trade-offs.

Weaknesses

* While the paper positions on-policy evaluation as a core contribution, the empirical and conceptual justification for its advantage over off-policy settings remains underdeveloped. Although Table 2 and Figure 5 show some rank changes between on- and off-policy settings, the paper does not provide insights on why those differences matter. It remains unclear what specific behavioral aspects of “interactive memory” are uniquely captured. In fact, one could argue that well-curated off-policy datas

Reviewer 02Rating 8Confidence 3

Strengths

1. The papers central critique of off-policy evaluation is compelling and well-articulated. The authors provide concrete evidence in table 2 that evaluation rankings of memory systems change when moving from an off-policy to an on-policy setup, proving that the distinction is not just theoretical but has practical consequences. 2. The introduction of write, read, and utilization failure metrics which gives better insights into failure modes than the usual accuracy metric. 3. The self-evolution e

Weaknesses

1. The work only evaluates memory for selecting the correct answer using multiple choice questions, but doesn't test the generation capabilities. 2. The memory is tested using structured key-value pairs and doesn't test the episodic memory or memory where the assistant has to reason over multiple facts.

Reviewer 03Rating 6Confidence 4

Strengths

- **Clear motivation**: Off-policy evaluations can induce reuse bias; AMEMGYM offers an on-policy, diagnostically rich setup. - **Methodological novelty**: Persona/state trajectories, exposure utterances, QA variants (with reflection) enable constrained interaction and automated scoring; normalized memory score and stage-wise failure analysis are useful. - **Thorough evaluation**: Quantifies on- vs off-policy discrepancies; characterizes long-horizon degradation of native LLMs; provides gran

Weaknesses

- **External validity of simulated users**: Add a small human-in-the-loop comparison and a systematic study of user-LLM choice. - **Broader baselines**: Include structured memory graphs/event stores, hierarchical compression, and explicit state trackers. - **Leakage control**: Provide anti-leak prompt design and automatic leakage checks. - **Metric reporting**: Add variance/CI and difficulty-conditioned analyses for the normalized memory score. - **Scope of self-evolution**: Jointly evol

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in Service Interactions · Personal Information Management and User Behavior · Social Robot Interaction and HRI