LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Zihao Cheng; Weixin Wang; Yu Zhao; Ziyang Ren; Jiaxuan Chen; Ruiyang Xu; Shuai Huang; Yang Chen; Guowei Li; Mengshi Wang; Yi Xie; Ren Zhu; Zeren Jiang; Keda Lu; Yihong Li; Xiaoliang Wang; Liwei Liu; Cam-Tu Nguyen

arXiv:2603.03781·cs.AI·March 5, 2026

LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Zihao Cheng, Weixin Wang, Yu Zhao, Ziyang Ren, Jiaxuan Chen, Ruiyang Xu, Shuai Huang, Yang Chen, Guowei Li, Mengshi Wang, Yi Xie, Ren Zhu, Zeren Jiang, Keda Lu, Yihong Li, Xiaoliang Wang, Liwei Liu, Cam-Tu Nguyen

PDF

Open Access

TL;DR

LifeBench is a new benchmark designed to evaluate AI agents' ability to perform long-horizon, multi-source memory reasoning that includes both declarative and non-declarative types, using diverse, real-world inspired data.

Contribution

This paper introduces LifeBench, a comprehensive benchmark that incorporates long-term, multi-source memory tasks with real-world data, addressing limitations of existing memory benchmarks.

Findings

01

State-of-the-art systems achieve only 55.2% accuracy on LifeBench.

02

LifeBench effectively challenges memory systems with long-horizon, multi-source reasoning tasks.

03

The dataset emphasizes data quality, diversity, and scalability, inspired by cognitive science principles.

Abstract

Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Machine Learning in Healthcare · Topic Modeling