VISD: Enhancing Video Reasoning via Structured Self-Distillation

Hao Lin; Kunyang Lv; Xu Jiang; Jingqi Tian; Zhongjing Du; Jiayu Ding; Qiaoman Zhang; Hongbo Jin

arXiv:2605.06094·cs.CV·May 12, 2026

VISD: Enhancing Video Reasoning via Structured Self-Distillation

Hao Lin, Kunyang Lv, Xu Jiang, Jingqi Tian, Zhongjing Du, Jiayu Ding, Qiaoman Zhang, Hongbo Jin

PDF

TL;DR

VISD introduces a structured self-distillation framework with a video-aware judge to improve reasoning accuracy and training efficiency for VideoLLMs by providing meaningful, multi-dimensional feedback.

Contribution

The paper proposes VISD, a novel structured self-distillation method that enhances video reasoning by decomposing reasoning quality into multiple dimensions and stabilizing training with RL.

Findings

01

VISD improves answer accuracy and spatio-temporal grounding quality.

02

VISD achieves nearly 2x faster convergence in training.

03

Structured feedback enhances reasoning faithfulness and training efficiency.

Abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.