TL;DR
This paper introduces a context repair method for video reasoning that leverages larger models as tools to identify and supply missing evidence, improving accuracy and generalization in multi-modal video understanding tasks.
Contribution
It proposes a novel observation-level intervention using a frozen teacher model and a new reward to enhance training, outperforming existing methods in video reasoning benchmarks.
Findings
Consistent accuracy improvements across multiple benchmarks.
Enhanced generalization capabilities in video reasoning tasks.
Effective use of larger models as tools for context repair.
Abstract
Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and demands careful regularization. Dynamic context methods zoom into focused evidence but often require curated pretraining and two-stage tuning, and their context remains bounded by a small model's capability. In contrast, larger models excel at instruction following and multi-modal understanding, can supply richer context to smaller models, and rapidly zoom in on target regions via simple tools. Building on this capability, we introduce an observation-level intervention: a frozen, tool-integrated teacher identifies the missing spatiotemporal dependency and provides a minimal evidence patch (e.g., timestamps, regions etc.) from the original video while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
