RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems
Mingcong Lei, Honghao Cai, Yuyuan Yang, Yimou Wu, Jinke Ren, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Shaohan Jiang, Ge Wang, Junyuan Tan, Zhenglin Wan, Zheng Li, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han

TL;DR
RoboMemory is a brain-inspired framework that integrates multiple memory types to enhance long-term, interactive learning and reasoning in embodied robots, demonstrating significant performance improvements in real-world tasks.
Contribution
The paper introduces RoboMemory, a novel multi-memory agentic framework that unifies various memory systems with a dynamic knowledge graph and adaptive planner for improved robotic learning.
Findings
Improves success rate by 26.5% over baseline
Outperforms state-of-the-art models like Claude-3.5-Sonnet
Demonstrates effective cumulative learning in real-world trials
Abstract
Embodied intelligence aims to enable robots to learn, reason, and generalize robustly across complex real-world environments. However, existing approaches often struggle with partial observability, fragmented spatial reasoning, and inefficient integration of heterogeneous memories, limiting their capacity for long-horizon adaptation. To address this, we introduce RoboMemory, a brain-inspired framework that unifies Spatial, Temporal, Episodic, and Semantic memory within a parallelized architecture for efficient long-horizon planning and interactive learning. Its core innovations are a dynamic spatial knowledge graph for scalable, consistent memory updates and a closed-loop planner with a critic module for adaptive decision-making. Extensive experiments on EmbodiedBench show that RoboMemory, instantiated with Qwen2.5-VL-72B-Ins, improves the average success rate by 26.5% over its strong…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper tackles an important problem: improving long horizon reasoning in complex embodied tasks. * The paper reports results in multiple environments, including the real world
1) I have several doubts about the experimental methodology in this paper. If I understand correctly, all the VLM baselines in the paper (or atleast the open-source ones) have zero historical awareness since they only take one frame in, but it is trivial and necessary to compare to a fairer baseline which takes multiple frames across the history as input to give it some required context - since that is something current day VLMs support easily. Is my understanding correct that this is not curren
1. The paper is well-written and easy to read. 2. It explores an important direction in developing embodied agents with memory mechanisms. 3. Experiments are conducted on a real robot. 4. The method demonstrates improvements on the tasks considered.
1. The **Related Works** section (at least in a shortened version) should be included in the main text. Its purpose is to position the work relative to existing approaches and highlight its novelty and relevance. 2. There is no comparison of the proposed method with other approaches on a real robot. 3. Experiments on tasks that truly require memory, beyond spatial memory, are missing. The authors should at least propose a small set of test tasks and demonstrate comparisons on them with a detaile
1. The design of the approach is very careful and well-inspired. The idea of borrowing knowledge from human-brain working is interesting, and the approach shows how to model this using MLLMs. 2. The authors test against various open-source and closed-source VLMs, as well as SOTA VLM-Agent framework ensuring a comprehensive evaluation. 3. The authors also show that their approach is effective in interactive learning in teh real-world which is a useful experiment.
1. While the design is careful, it may be possible to simplify the approach by combining various aspects of the different memories created. 2. The authors do not discuss the hardware, costs and compute required for their approach against the baselines methods. This method may achieve success, but because of the several VLMs and VLAs involved, it may consume a lot of resources. 3. The real-world evaluation is done at a smaller-scale, with only 15 tasks and the success rates are still pretty-low (
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Modular Robots and Swarm Intelligence · Multi-Agent Systems and Negotiation
