GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Jeong Hun Yeo; Sangyun Chung; Sungjune Park; Dae Hoe Kim; Jinyoung Moon; Yong Man Ro

arXiv:2511.12027·cs.CV·November 18, 2025

GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory

Jeong Hun Yeo, Sangyun Chung, Sungjune Park, Dae Hoe Kim, Jinyoung Moon, Yong Man Ro

PDF

Open Access

TL;DR

GCAgent introduces a structured episodic memory framework that significantly improves long-video understanding in multimodal large language models by modeling event relations and maintaining global context.

Contribution

The paper presents GCAgent, a novel framework with schematic and narrative episodic memory, enabling deep long-video reasoning and surpassing existing models in accuracy and contextual understanding.

Findings

01

Achieves up to 23.5% accuracy improvement on Video-MME Long split.

02

Establishes state-of-the-art performance among 7B-scale MLLMs.

03

Validates the effectiveness of structured memory for long-video reasoning.

Abstract

Long-video understanding remains a significant challenge for Multimodal Large Language Models (MLLMs) due to inherent token limitations and the complexity of capturing long-term temporal dependencies. Existing methods often fail to capture the global context and complex event relationships necessary for deep video reasoning. To address this, we introduce GCAgent, a novel Global-Context-Aware Agent framework that achieves comprehensive long-video understanding. Our core innovation is the Schematic and Narrative Episodic Memory. This memory structurally models events and their causal and temporal relations into a concise, organized context, fundamentally resolving the long-term dependency problem. Operating in a multi-stage Perception-Action-Reflection cycle, our GCAgent utilizes a Memory Manager to retrieve relevant episodic context for robust, context-aware inference. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling