MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Hyunji Lee; Justin Chih-Yao Chen; Joykirat Singh; Zaid Khan; Elias Stengel-Eskin; Mohit Bansal

arXiv:2605.18565·cs.CL·May 20, 2026

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

PDF

1 Datasets

TL;DR

This paper introduces MINTEval, a comprehensive benchmark for evaluating memory-augmented agents' ability to handle interference and evolving information in long-horizon tasks across diverse domains.

Contribution

The paper presents MINTEval, a novel benchmark with diverse, long, and interference-heavy contexts to assess the robustness of memory systems in realistic scenarios.

Findings

01

All evaluated systems perform poorly with an average accuracy of 27.9%.

02

Performance drops as the number of context updates increases.

03

Current memory systems struggle with recall and reasoning over revised facts.

Abstract

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

dinobby/MINTEval
dataset· 132 dl
132 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.