APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval
Hong Gao, Yiming Bao, Xuezhen Tu, Bin Zhong, Linan Yue, Minling Zhang

TL;DR
APVR is a training-free framework that hierarchically retrieves and retains important visual information from hour-long videos, enabling large language models to better understand long videos without heavy resource demands.
Contribution
The paper introduces APVR, a novel hierarchical retrieval method that overcomes memory and resource limitations for long video understanding in multimodal large language models.
Findings
Achieves up to 9.7% performance improvement on LongVideoBench.
Demonstrates state-of-the-art results for training-free and training-based methods.
Effectively processes hour-long videos with maintained semantic fidelity.
Abstract
Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
