GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory
Jeong Hun Yeo, Sangyun Chung, Sungjune Park, Dae Hoe Kim, Jinyoung Moon, Yong Man Ro

TL;DR
GCAgent introduces a structured episodic memory framework that significantly improves long-video understanding in multimodal large language models by modeling event relations and maintaining global context.
Contribution
The paper presents GCAgent, a novel framework with schematic and narrative episodic memory, enabling deep long-video reasoning and surpassing existing models in accuracy and contextual understanding.
Findings
Achieves up to 23.5% accuracy improvement on Video-MME Long split.
Establishes state-of-the-art performance among 7B-scale MLLMs.
Validates the effectiveness of structured memory for long-video reasoning.
Abstract
Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
