VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

Daeun Lee; Shoubin Yu; Yue Zhang; Mohit Bansal

arXiv:2603.14659·cs.CV·March 17, 2026

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting

Daeun Lee, Shoubin Yu, Yue Zhang, Mohit Bansal

PDF

Open Access

TL;DR

VisionCoach introduces a reinforcement learning framework that uses visual prompts during training to enhance spatio-temporal grounding in video reasoning, enabling accurate reasoning without additional tools at inference.

Contribution

The paper proposes a novel visual-prompting RL approach with self-distillation for grounded video reasoning, reducing reliance on external tools and improving performance.

Findings

01

Achieves state-of-the-art results on multiple video reasoning benchmarks.

02

Improves spatio-temporal grounding accuracy through visual prompting during training.

03

Enables inference without external perception tools via self-distillation.

Abstract

Video reasoning requires models to locate and track question-relevant evidence across frames. While reinforcement learning (RL) with verifiable rewards improves accuracy, it still struggles to achieve reliable spatio-temporal grounding during the reasoning process. Moreover, improving grounding typically relies on scaled training data or inference-time perception tools, which increases annotation cost or computational cost. To address this challenge, we propose VisonCoach, an input-adaptive RL framework that improves spatio-temporal grounding through visual prompting as training-time guidance. During RL training, visual prompts are selectively applied to challenging inputs to amplify question-relevant evidence and suppress distractors. The model then internalizes these improvements through self-distillation, enabling grounded reasoning directly on raw videos without visual prompting at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling