Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Jiaze Li; Hao Yin; Haoran Xu; Boshen Xu; Wenhui Tan; Zewen He; Jianzhong Ju; Zhenbo Luo; Jian Luan

arXiv:2602.02994·cs.CV·May 15, 2026

Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation

Jiaze Li, Hao Yin, Haoran Xu, Boshen Xu, Wenhui Tan, Zewen He, Jianzhong Ju, Zhenbo Luo, Jian Luan

PDF

TL;DR

Video-OPD introduces an efficient on-policy distillation framework for temporal video grounding, improving training speed and reducing computational costs compared to traditional reinforcement learning methods.

Contribution

The paper presents Video-OPD, a novel on-policy distillation approach with a curriculum, enhancing training efficiency and performance in temporal video grounding tasks.

Findings

01

Video-OPD outperforms GRPO in accuracy and convergence speed.

02

The framework reduces computational overhead significantly.

03

Teacher-Validated Disagreement Focusing improves training efficiency.

Abstract

Reinforcement learning has emerged as a principled post-training paradigm for Temporal Video Grounding (TVG) due to its on-policy optimization, yet existing GRPO-based methods remain fundamentally constrained by sparse reward signals and substantial computational overhead. We propose Video-OPD, an efficient post-training framework for TVG inspired by recent advances in on-policy distillation. Video-OPD optimizes trajectories sampled directly from the current policy, thereby preserving alignment between training and inference distributions, while a frontier teacher supplies dense, token-level supervision via a reverse KL divergence objective. This formulation preserves the on-policy property critical for mitigating distributional shift, while converting sparse, episode-level feedback into fine-grained, step-wise learning signals. Building on Video-OPD, we introduce Teacher-Validated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.