Iterative Zoom-In: Temporal Interval Exploration for Long Video Understanding
Chenglin Li, Qianglong Chen, fengtao, Yin Zhang

TL;DR
This paper presents Temporal Search, a training-free iterative framework that enhances long video understanding by dynamically exploring relevant temporal intervals using confidence-based refinement and a best-first search strategy.
Contribution
It introduces a novel, training-free temporal exploration method for MLLMs that improves long video comprehension by focusing on task-relevant intervals iteratively.
Findings
Improves long video understanding without additional training.
Reduces memory consumption compared to dense sampling.
Achieves better accuracy by focusing on relevant temporal regions.
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance in video understanding tasks. However, they continue to struggle with long-form videos because of an inefficient perception of temporal intervals. Unlike humans, who can dynamically adjust their temporal focus to locate query-relevant moments, current MLLMs often rely on dense, uniform sampling across the video timeline, leading to high memory consumption and a risk of missing crucial information. To address this challenge, we introduce Temporal Search, a training-free framework that enables MLLMs to explore temporal regions for improved long video understanding iteratively. TS is based on a key observation: the model's generation confidence across different temporal intervals is highly correlated with prediction accuracy. TS operates through two main iterative stages. First, the MLLM proposes a temporal interval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
