REMem: Reasoning with Episodic Memory in Language Agent
Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jim\'enez Guti\'errez, Weijian Qi, Kamalika Das, Huan Sun, Yu Su

TL;DR
REMem introduces a novel episodic memory framework for language agents, enabling effective recollection and reasoning over interaction histories, significantly outperforming existing memory systems on multiple benchmarks.
Contribution
It formalizes the core challenges of episodic memory in language agents and proposes a two-phase framework with hybrid memory graphs and iterative retrieval for improved reasoning.
Findings
Outperforms state-of-the-art memory systems by 3.4% and 13.4% on key tasks.
Demonstrates robust refusal behavior for unanswerable questions.
Achieves significant improvements across four episodic memory benchmarks.
Abstract
Humans excel at remembering concrete experiences along spatiotemporal contexts and performing reasoning across those events, i.e., the capacity for episodic memory. In contrast, memory in language agents remains mainly semantic, and current agents are not yet capable of effectively recollecting and reasoning over interaction histories. We identify and formalize the core challenges of episodic recollection and reasoning from this gap, and observe that existing work often overlooks episodicity, lacks explicit event modeling, or overemphasizes simple retrieval rather than complex reasoning. We present REMem, a two-phase framework for constructing and reasoning with episodic memory: 1) Offline indexing, where REMem converts experiences into a hybrid memory graph that flexibly links time-aware gists and facts. 2) Online inference, where REMem employs an agentic retriever with carefully…
Peer Reviews
Decision·ICLR 2026 Poster
- balanced comparison with 4 datasets and 3 non-naive baselines - sufficiently detailed analysis of results
(check details in the question section) - an apparent oximoron? - lightweigth contribution (with apparent oximoron?) - lack of statistical rigour in the analysis - doubtful numbers reported, with lower than publicly reported baselines
- Grounding the events in time by design (compared to e.g., Mem0) - Better performance over HippoRAG2 and Mem0 for most experiments - Error analysis explanation in Sec. 6.4 regarding the remaining gap of performance in this experiment - Correct evaluation metrics
- Limitations have not been highlighted - No conflict detection and delete/update, compared to Mem0. This is not discussed at all in the paper - Experimental settings can be discussed, in particular the chunking / short size of each session - Modeling of events with grounding in time, but other aspects like spatial location are not systematically modeled - No data/code provided at submission time
- Presentation: Very well motivated and very well written paper (clear problem framing, separating episodic recollection from episodic reasoning and evaluates both). Figures are also nice and helpful. - Good execution: solid engineering, reasonable baselines, ablations that show both gists and facts matter, and an efficiency comparison. Refusal analysis is an added bonus.
- (main concern) The paper evaluates on relatively controlled settings where proper knowledge extraction is assumed to work. The most critical and challenging aspect of the work (how robustly the gist/fact extraction generalizes to noisy, ambiguous, real-world text) receives minimal treatment in the paper. Once you have clean, well-structured graphs, the superior performance on episodic reasoning tasks becomes somewhat predictable rather than surprising. Unfortunately, the paper spent most of it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Speech and dialogue systems
