MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

TL;DR
This paper introduces MINTEval, a comprehensive benchmark for evaluating memory-augmented agents' ability to handle interference and evolving information in long-horizon tasks across diverse domains.
Contribution
The paper presents MINTEval, a novel benchmark with diverse, long, and interference-heavy contexts to assess the robustness of memory systems in realistic scenarios.
Findings
All evaluated systems perform poorly with an average accuracy of 27.9%.
Performance drops as the number of context updates increases.
Current memory systems struggle with recall and reasoning over revised facts.
Abstract
Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
