Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
Houlun Chen, Xin Wang, Guangyao Li, Yuwei Zhou, Yihan Chen, Jia Jia, Wenwu Zhu

TL;DR
This paper introduces Video-TwG, a curriculum reinforced framework for long video understanding that employs active on-demand grounding to improve reasoning accuracy and reduce hallucinations in video question answering tasks.
Contribution
The paper proposes a novel Think-with-Grounding paradigm with a two-stage reinforced curriculum strategy and a new TwG-GRPO algorithm, enabling more effective and scalable long video reasoning.
Findings
Outperforms strong LVU baselines on multiple datasets.
Two-stage curriculum improves grounding and reasoning.
TwG-GRPO leverages unlabeled data effectively.
Abstract
Long video understanding is challenging due to rich and complicated multimodal clues in long temporal range.Current methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form reasoning.However,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long videos.To address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when necessary.Video-TwG can be trained end-to-end in a straightforward manner, without relying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
