TL;DR
Lightning OPD introduces an offline on-policy distillation method for large language models that enforces teacher consistency, achieving comparable performance to traditional methods with significantly higher training efficiency.
Contribution
It proposes Lightning OPD, a novel offline distillation framework that eliminates the need for a live teacher server by ensuring teacher consistency during training.
Findings
Lightning OPD achieves 4.0x higher training efficiency than standard OPD.
It reaches 69.9% on AIME 2024 with just 30 GPU hours starting from SFT.
The method scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024.
Abstract
On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
