Visual-Advantage On-Policy Distillation for Vision-Language Models

Ruiqi Liu; Xiaolei Lv; Gengsheng Li; Ximo Zhu; Zhiheng Wang; Zhengbo Zhang; Junkai Chen; Zhiheng Li; Bo Li; Jun Gao; Shu Wu

arXiv:2605.21924·cs.CV·May 22, 2026

Visual-Advantage On-Policy Distillation for Vision-Language Models

Ruiqi Liu, Xiaolei Lv, Gengsheng Li, Ximo Zhu, Zhiheng Wang, Zhengbo Zhang, Junkai Chen, Zhiheng Li, Bo Li, Jun Gao, Shu Wu

PDF

TL;DR

This paper introduces Visual-Advantage On-Policy Distillation (VA-OPD), a novel method that enhances vision-language model training by emphasizing tokens with high visual supervision signals, leading to consistent performance improvements.

Contribution

The paper proposes VA-OPD, a new distillation approach that leverages token-level visual advantage to improve vision-language models across multiple benchmarks.

Findings

01

VA-OPD outperforms standard distillation on all benchmarks.

02

Performance gains increase with larger teacher models and more data.

03

High-VA tokens are key to effective visual supervision transfer.

Abstract

On-policy knowledge distillation has proven effective for language models, yet its application to vision-language models (VLMs) remains underexplored. We observe that standard on-policy distillation can improve a student's output quality while failing to strengthen its reliance on visual input: on vision-critical tokens, the student's predictions remain largely unchanged whether or not fine-grained visual detail is present, even though the teacher's predictions depend heavily on it.To make this difference observable, we introduce visual advantage (VA), the token-level log-probability difference when the teacher scores a student-generated rollout with versus without access to fine-grained visual detail. VA is concentrated in a small minority of tokens, and these high-VA tokens are the ones that actually carry the visual supervision signal. This motivates a distillation objective that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.