LightMem: Lightweight and Efficient Memory-Augmented Generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

TL;DR
LightMem is a novel memory system for LLMs that balances performance and efficiency by mimicking human memory stages, significantly improving accuracy and reducing computational costs in dynamic environments.
Contribution
Introduces LightMem, a memory architecture inspired by human memory stages, that enhances LLM performance while reducing computational overhead.
Findings
Improves QA accuracy by up to 7.7% / 29.3%.
Reduces token usage by up to 38x / 20.9x.
Cuts API calls by up to 55.5x / 310x.
Abstract
Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these…
Peer Reviews
Decision·ICLR 2026 Poster
The architecture is inspired by cognitive science, and the translation of the human memory model into an LLM framework is novel and well-motivated. The model demonstrates promising efficiency compared to other memory-augmented systems, while achieving superior performance on the evaluated benchmark.
The paper claims significant efficiency gains, but this is not well supported. LightMem introduces at least three additional components (a compression model, an embedding or topic model, and a summarization model), making it unclear how overall runtime and token usage could be lower than baselines such as RAG. Were the costs of “sleep-time” updates included? The paper should provide standard efficiency metrics such as FLOPs or throughput, or a theoretical analysis explaining why and by how much
* I think this is decent engineering effort to build memory augmented LLMs.
* I am not sure about the novelty of this work. The base idea seems to have been studied in prior works with different names/flavors.
By and large, the paper is well written and well motivated. The stages of creating and generating memory entries are inspired by the model of the human brain. The experimental setup is sufficient and results are convincing. The field of memory-augmented LLMs is crowded. While the paper does not present fundamentally different ideas to the ones in previous work, I still find it interesting and I think that it is a solid addition to this growing body of work.
I don’t have any major issues with the paper. The description in Section 3.3 is not clear to me. What happens after a queue for each entry is created? How is f_{update} implemented? What happens when the LTM reaches capacity?
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications
