LMEB: Long-horizon Memory Embedding Benchmark
Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang

TL;DR
The paper introduces LMEB, a comprehensive benchmark with diverse datasets and tasks to evaluate long-horizon memory embeddings, highlighting the challenges and current limitations of models in complex memory retrieval scenarios.
Contribution
It presents LMEB, the first extensive benchmark for evaluating long-horizon memory embeddings across multiple memory types and tasks, filling a critical gap in the field.
Findings
Larger models do not always outperform smaller ones.
LMEB and MTEB measure different capabilities.
Traditional passage retrieval performance does not guarantee success in long-horizon memory retrieval.
Abstract
Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models' ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this gap, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework for evaluating embedding models on complex, long-horizon memory retrieval. LMEB comprises 22 datasets and 193 zero-shot retrieval tasks spanning four memory types: episodic, dialogue, semantic, and procedural. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
