APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao; Yiming Bao; Xuezhen Tu; Bin Zhong; Linan Yue; Minling Zhang

arXiv:2506.04953·cs.CV·November 18, 2025

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Hong Gao, Yiming Bao, Xuezhen Tu, Bin Zhong, Linan Yue, Minling Zhang

PDF

Open Access

TL;DR

APVR is a training-free framework that hierarchically retrieves and retains important visual information from hour-long videos, enabling large language models to better understand long videos without heavy resource demands.

Contribution

The paper introduces APVR, a novel hierarchical retrieval method that overcomes memory and resource limitations for long video understanding in multimodal large language models.

Findings

01

Achieves up to 9.7% performance improvement on LongVideoBench.

02

Demonstrates state-of-the-art results for training-free and training-based methods.

03

Effectively processes hour-long videos with maintained semantic fidelity.

Abstract

Current multimodal large language models (MLLMs) struggle with hour-level video understanding, facing significant challenges not only in modeling the substantial information volume of long videos but also in overcoming the memory wall and resource constraints during both training and inference. Although recent training-free approaches have alleviated resource demands by compressing visual features, their reliance on incomplete visual information limits the performance potential. To address these limitations, we propose Adaptive Pivot Visual information Retrieval (APVR), a training-free framework that hierarchically retrieves and retains sufficient and important visual information. It breakthroughs the memory wall limitation via two complementary components: Pivot Frame Retrieval employs query expansion and iterative spatio-semantic confidence scoring to identify relevant video frames,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning