Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory
Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li

TL;DR
The paper presents M3-Agent, a multimodal agent with long-term memory capable of multi-turn reasoning and knowledge accumulation, evaluated on a new long-video question answering benchmark, outperforming existing baselines.
Contribution
Introduction of M3-Agent with entity-centric, multimodal long-term memory and a new benchmark, M3-Bench, for evaluating memory-based reasoning in multimodal agents.
Findings
M3-Agent outperforms baseline models on M3-Bench and VideoMME-long datasets.
The agent effectively integrates visual and auditory inputs for reasoning.
Reinforcement learning enhances the agent's performance over prompting methods.
Abstract
We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update episodic and semantic memories, gradually accumulating world knowledge. Its memory is organized in an entity-centric, multimodal manner, enabling deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn reasoning and retrieves relevant memories to complete tasks. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a long-video question answering benchmark comprising 100 newly recorded robot-perspective videos (M3-Bench-robot) and 920 diverse web-sourced videos (M3-Bench-web). We annotate QA pairs designed to test capabilities essential for agent applications, such as person understanding,…
Peer Reviews
Decision·ICLR 2026 Poster
The paper provides both a practical recipe for a long-term memory, as well as a dataset and benchmark. In my opinion, it is an excellent contribution. The paper is well-written and easy to follow; despite there being quite a lot of work being described. The memory agent is interesting and flexible; involves tool use -- e.g. to do facial recognition and associate memories with the speaker. The proposed agent serves as an excellent baseline on the proposed benchmark. The fact that DAPO on th
Overall I don't see major weaknesses; some minor ones about the memory (which I see as a baseline on the new benchmark). * The control policy is trained using RL, but memory uses only SFT. Though it may be tricky to implement from an engineering perspective, RL seems especially useful also for the memory policy, too. * Due to the method of constructing of entities in the memory dictionary, it seems speaker faces must be visible
1. M3-Bench is a novel evaluation benchmark designed to assess memory-based reasoning beyond low-level perception. This dataset comprises 100 robot-egocentric videos and 920 web videos, offering a good option for agent evaluation. The data processing and evaluation reflect the author's significant efforts. 2. This paper provides a detailed account of the development process and implementation specifics for constructing M3-bench and M3-agent, supported by comprehensive experiments. Furthermore,
1. Automatic grading uses GPT-4o to judge correctness; although this paper report 96% agreement with human majority on a small set, relying on a single proprietary LLM as an oracle can introduce bias and inflate or depress certain methods’ scores. 2. GPT-4o was employed for data generation, evaluation, and baseline comparison. On one hand, this diminishes the paper's contribution, making it more akin to prompt engineering using GPT-4o. On the other hand, constructing data and conducting evaluat
1. This paper proposed a challenging and realistic video understanding benchmark M3-Bench that goes beyond existing works. Instead of relying solely on visual perception, it further requires the agent to leverage higher-level cognitive abilities and integrate world knowledge to solve complex tasks that involve long-horizon dependencies, temporal reasoning, and cross-modal understanding. Moreover, M3-Bench fills an important gap in evaluating memory-based long-term multimodal reasoning, especial
1. Lack of real methodological novelty. The overall pipeline resembles VideoAgent [1]: the agent first perceives multimodal inputs, then stores textualized or embedding-based observations into an external memory, performs RAG during query time, and finally conducts multi-step reasoning over the retrieved memory to produce answers. The entity graph is not novel, similar identity linking appears in StoryTeller [2]. The RL training also follows a conventional approach, similar to Search-R1 [3], mak
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
