VISD: Enhancing Video Reasoning via Structured Self-Distillation
Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

TL;DR
VISD introduces a structured self-distillation framework with a video-aware judge to improve reasoning accuracy and training efficiency for VideoLLMs by providing meaningful, multi-dimensional feedback.
Contribution
The paper proposes VISD, a novel structured self-distillation method that enhances video reasoning by decomposing reasoning quality into multiple dimensions and stabilizing training with RL.
Findings
VISD improves answer accuracy and spatio-temporal grounding quality.
VISD achieves nearly 2x faster convergence in training.
Structured feedback enhances reasoning faithfulness and training efficiency.
Abstract
Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
