Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding
Yun Wang, Long Zhang, Jingren Liu, Jiaqi Yan, Zhanjie Zhang, Jiahao Zheng, Ao Ma, Run Ling, Xun Yang, Dapeng Wu, Xiangyu Chen, Xuelong Li

TL;DR
Video-EM introduces an event-centric episodic memory framework that enhances long-form video understanding by constructing and refining a compact, grounded event timeline for improved reasoning and question answering.
Contribution
It presents a training-free, active memory approach that localizes, groups, and encodes events with explicit spatio-temporal cues, improving long-form VideoQA without additional training.
Findings
Creates a reliable event timeline for long videos
Reduces redundancy and noise in episodic memory
Enhances VideoQA performance with minimal memory set
Abstract
Video Large Language Models (Video-LLMs) have shown strong video understanding, yet their application to long-form videos remains constrained by limited context windows. A common workaround is to compress long videos into a handful of representative frames via retrieval or summarization. However, most existing pipelines score frames in isolation, implicitly assuming that frame-level saliency is sufficient for downstream reasoning. This often yields redundant selections, fragmented temporal evidence, and weakened narrative grounding for long-form video question answering. We present \textbf{Video-EM}, a training-free, event-centric episodic memory framework that reframes long-form VideoQA as \emph{episodic event construction} followed by \emph{memory refinement}. Instead of treating retrieved keyframes as independent visuals, Video-EM employs an LLM as an active memory agent to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
