Visual-Advantage On-Policy Distillation for Vision-Language Models
Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

TL;DR
This paper introduces Visual-Advantage On-Policy Distillation (VA-OPD), a novel method that enhances vision-language model training by emphasizing tokens with high visual supervision signals, leading to consistent performance improvements.
Contribution
The paper proposes VA-OPD, a new distillation approach that leverages token-level visual advantage to improve vision-language models across multiple benchmarks.
Findings
VA-OPD outperforms standard distillation on all benchmarks.
Performance gains increase with larger teacher models and more data.
High-VA tokens are key to effective visual supervision transfer.
Abstract
On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
