Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee; Yu-Chiang Frank Wang; Ryo Hachiuma

arXiv:2512.22238·cs.LG·December 30, 2025

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma

PDF

Open Access

TL;DR

This paper introduces Masters, a novel framework for distilling vision-language models by masking and progressively restoring the teacher's weights, combined with offline reinforcement learning to improve student learning stability and performance.

Contribution

The paper proposes a mask-progressive RL distillation method that enhances knowledge transfer from large teachers to small students in vision-language models.

Findings

01

Improved student model performance on vision-language tasks.

02

Stable and efficient knowledge distillation process.

03

Effective offline RL rewards for guiding student learning.

Abstract

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling