Enhancing Long Video Understanding via Hierarchical Event-Based Memory
Dingxin Cheng, Mingda Li, Jingyu Liu, Yongxin Guo, Bin Jiang, Qingbin, Liu, Xi Chen, Bo Zhao

TL;DR
This paper introduces HEM-LLM, a hierarchical memory model that segments long videos into events for improved understanding by reducing redundancy and enhancing inter-event dependencies, achieving state-of-the-art results.
Contribution
The paper proposes a novel adaptive segmentation scheme and hierarchical memory modeling for long videos, addressing redundancy and dependency issues in video understanding.
Findings
Achieves state-of-the-art performance on multiple video understanding tasks.
Effectively reduces information redundancy in long videos.
Enhances long-term inter-event dependency modeling.
Abstract
Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Video Analysis and Summarization · Visual Attention and Saliency Detection
