Embarrassingly Simple Self-Distillation Improves Code Generation
Ruixiang Zhang, Richard He Bai, Huangjie Zheng, Navdeep Jaitly, Ronan Collobert, Yizhe Zhang

TL;DR
This paper demonstrates that simple self-distillation, involving sampling and fine-tuning, significantly enhances large language models' code generation capabilities without external verification or reinforcement learning.
Contribution
It introduces a straightforward self-distillation method that improves code generation across multiple models and scales, revealing insights into decoding dynamics.
Findings
SSD improves pass@1 from 42.4% to 55.3% on LiveCodeBench v6.
Gains are concentrated on harder problems and generalize across models and scales.
SSD reshapes token distributions, balancing exploration and precision.
Abstract
Can a large language model (LLM) improve at code generation using only its own raw outputs, without a verifier, a teacher model, or reinforcement learning? We answer in the affirmative with simple self-distillation (SSD): sample solutions from the model with certain temperature and truncation configurations, then fine-tune on those samples with standard supervised fine-tuning. SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6, with gains concentrating on harder problems, and it generalizes across Qwen and Llama models at 4B, 8B, and 30B scale, including both instruct and thinking variants. To understand why such a simple method can work, we trace these gains to a precision-exploration conflict in LLM decoding and show that SSD reshapes token distributions in a context-dependent way, suppressing distractor tails where precision matters while preserving useful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗apple/SimpleSD-4B-instructmodel· 1.1k dl· ♡ 41.1k dl♡ 4
- 🤗apple/SimpleSD-4B-thinkingmodel· 213 dl· ♡ 3213 dl♡ 3
- 🤗apple/SimpleSD-30B-instructmodel· 575 dl· ♡ 6575 dl♡ 6
- 🤗ml-intern-explorers/ssd-qwen3vl-oxfordpetsmodel· ♡ 1♡ 1
- 🤗ludsvick/gemma-4-E2B-it-SSDmodel
- 🤗moos124/ssd-distilled-qwen2.5-1.5bmodel
- 🤗shaneMattner/Qwen3.6-35B-A3B-RFTmodel· 19 dl19 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
