VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal

TL;DR
VisionCoach introduces a reinforcement learning framework that uses visual prompts during training to enhance spatio-temporal grounding in video reasoning, enabling accurate reasoning without additional tools at inference.
Contribution
The paper proposes a novel visual-prompting RL approach with self-distillation for grounded video reasoning, reducing reliance on external tools and improving performance.
Findings
Achieves state-of-the-art results on multiple video reasoning benchmarks.
Improves spatio-temporal grounding accuracy through visual prompting during training.
Enables inference without external perception tools via self-distillation.
Abstract
Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
