SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Xiwen Chen; Jingjing Wang; Wenhui Zhu; Peijie Qiu; Xuanzhao Dong; Hejian Sang; Zhipeng Wang; Alborz Geramifard; Feng Luo

arXiv:2604.03873·cs.LG·April 24, 2026

SODA: Semi On-Policy Black-Box Distillation for Large Language Models

Xiwen Chen, Jingjing Wang, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hejian Sang, Zhipeng Wang, Alborz Geramifard, Feng Luo

PDF

TL;DR

SODA introduces a semi on-policy distillation method for large language models that achieves high-quality results efficiently by pairing teacher responses with static student outputs, avoiding costly dynamic rollouts.

Contribution

The paper presents SODA, a novel distillation approach that eliminates adversarial training and reduces computational costs while maintaining or improving performance.

Findings

01

SODA matches or outperforms state-of-the-art methods on 15 out of 16 benchmarks.

02

It trains 10 times faster and uses 27% less GPU memory than previous methods.

03

It completely eliminates adversarial instability in distillation.

Abstract

Black-box knowledge distillation for large language models presents a strict trade-off. Simple off-policy methods (e.g., sequence-level knowledge distillation) struggle to correct the student's inherent errors. Fully on-policy methods (e.g., Generative Adversarial Distillation) solve this via adversarial training but introduce well-known training instability and crippling computational overhead. To address this dilemma, we propose SODA (Semi On-policy Distillation with Alignment), a highly efficient alternative motivated by the inherent capability gap between frontier teachers and much smaller base models. Because a compact student model's natural, zero-shot responses are almost strictly inferior to the powerful teacher's targets, we can construct a highly effective contrastive signal simply by pairing the teacher's optimal response with a one-time static snapshot of the student's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.