OVD: On-policy Verbal Distillation
Jing Xiong, Hui Shen, Shansan Gong, Yuxin Cheng, Jianghan Shen, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Ngai Wong

TL;DR
OVD introduces a memory-efficient on-policy verbal distillation framework that replaces token-level matching with trajectory matching using verbal scores, enabling better exploration and improved performance in reasoning tasks.
Contribution
The paper proposes a novel on-policy verbal distillation method that reduces memory usage and enhances exploration by replacing token-level alignment with trajectory-level matching using verbal scores.
Findings
Significant performance improvements on Web Q&A and math reasoning tasks.
Reduced memory consumption compared to existing token-level distillation methods.
Faster training efficiency with comparable or better results.
Abstract
Knowledge distillation offers a promising path to transfer reasoning capabilities from large teacher models to efficient student models; however, existing token-level on-policy distillation methods require token-level alignment between the student and teacher models, which restricts the student model's exploration ability, prevent effective use of interactive environment feedback, and suffer from severe memory bottlenecks in reinforcement learning. We introduce On-policy Verbal Distillation (OVD), a memory-efficient framework that replaces token-level probability matching with trajectory matching using discrete verbal scores (0--9) from teacher models. OVD dramatically reduces memory consumption while enabling on-policy distillation from teacher models with verbal feedback, and avoids token-level alignment, allowing the student model to freely explore the output space. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Topic Modeling · Multimodal Machine Learning Applications
