Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Yun Wang; Long Zhang; Jingren Liu; Jiaqi Yan; Zhanjie Zhang; Jiahao Zheng; Ao Ma; Run Ling; Xun Yang; Dapeng Wu; Xiangyu Chen; Xuelong Li

arXiv:2508.09486·cs.CV·March 10, 2026

Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding

Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

PDF

TL;DR

Video-EM introduces an event-centric episodic memory framework that enhances long-form video understanding by constructing and refining a compact, grounded event timeline for improved reasoning and question answering.

Contribution

It presents a training-free, active memory approach that localizes, groups, and encodes events with explicit spatio-temporal cues, improving long-form VideoQA without additional training.

Findings

01

Creates a reliable event timeline for long videos

02

Reduces redundancy and noise in episodic memory

03

Enhances VideoQA performance with minimal memory set

Abstract

Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.