Reasoning Compression with Mixed-Policy Distillation
Han Yang, Mingyan Wu, Bailan He, Zeyu Cao, Sikuan Yan, Kevin Qinghong Lin, Zifeng Ding

TL;DR
This paper introduces Mixed-Policy Distillation (MPD), a novel framework that transfers concise reasoning behaviors from large models to smaller ones, reducing token usage and improving reasoning performance.
Contribution
The paper proposes MPD, a new distillation method that combines on-policy and off-policy approaches to effectively compress reasoning traces from large to small models.
Findings
MPD reduces token usage by up to 27.1%.
MPD improves reasoning benchmark performance.
MPD effectively transfers reasoning compression from large to small models.
Abstract
Reasoning-centric large language models (LLMs) achieve strong performance by generating intermediate reasoning trajectories, but often incur excessive token usage and high inference-time decoding cost. We observe that, when solving the same problems, larger reasoning models can often produce more concise traces, whereas smaller reasoning models tend to generate longer and more redundant trajectories. This is especially problematic in real-world deployment, where memory, latency, and serving-cost constraints often favor smaller models. Our observations suggest that reasoning compression can be transferred from large models to small ones rather than enforced through explicit length constraints. Based on this insight, we propose Mixed-Policy Distillation (MPD), a reasoning compression framework that transfers concise reasoning behavior from a larger-sized teacher to a smaller student by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
