Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Zhixuan Wu, Quanxing Zha, Teng Wang, Genbao Xu, Wenyuan Gu, Wei Rao, Nan Ma, Bo Cheng, Soujanya Poria

TL;DR
Chain-of-Glimpse introduces a search-guided, multi-step reasoning framework for video understanding that explicitly grounds reasoning in visual evidence, improving interpretability and robustness.
Contribution
It proposes a novel, reinforcement learning-based, step-by-step reasoning method that explicitly anchors reasoning to visual evidence regions in videos.
Findings
Achieves consistent performance improvements on multiple video reasoning benchmarks.
Demonstrates robustness and generalization across diverse datasets.
Provides interpretable multi-step reasoning trajectories.
Abstract
Video understanding requires identifying and reasoning over semantically discriminative visual objects across frames, yet existing object-agnostic solutions struggle to effectively handle substantial object variations over time. To address this, we introduce Chain-of-Glimpse, a search-guided progressive object-grounded reasoning framework that explicitly anchors each reasoning step to specific visual evidence regions, enabling compositional and multi-step decision-making. Formally, Chain-of-Glimpse formulates video reasoning as a step-by-step process that incrementally builds spatially grounded traces around task-relevant visual objects, thereby mitigating over-reliance on saliency-driven cues. Specifically, Chain-of-Glimpse features a search-guided controller, optimized via reinforcement learning with a format reward that significantly incentivizes grounding capability, to iteratively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
