AdaFocus: Adaptive Relevance-Diversity Sampling with Zero-Cache Look-back for Efficient Long Video Understanding
Xiao Yang, Yingzhe Ma, Haoxuan Yu, Zixin Li, Ning Qin

TL;DR
AdaFocus introduces a progressive evidence acquisition framework for long video understanding, combining adaptive sampling and zero-cache disk retrieval to improve efficiency and accuracy.
Contribution
It proposes a novel adaptive relevance-diversity sampler and an on-demand evidence refinement mechanism, enabling scalable long-video reasoning without exhaustive preloading.
Findings
Achieves +2.59 accuracy on VideoMME
Improves mIoU by +8.39 on Charades-STA
Reduces visual token consumption by ~33x
Abstract
Long video understanding is heavily bottlenecked by a rigid one-shot paradigm: existing methods either densely encode videos at prohibitive memory and latency costs, or aggressively compress them into sparse frame sets that irreversibly discard fine-grained evidence needed for downstream reasoning. Consequently, current models struggle to simultaneously balance temporal coverage, visual details, and computational efficiency. We propose AdaFocus, an efficient framework that rethinks long-video understanding as progressive evidence acquisition rather than one-pass encoding. AdaFocus relies on two tightly coupled components. First, a Query-Aware Adaptive Relevance-Diversity sampler (AdaRD) produces a compact yet informative video preview, adaptively switching to global clustering when the query lacks reliable local grounding. Second, instead of caching exhaustive frame sequences in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
