Self-Distillation for Multi-Token Prediction
Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

TL;DR
This paper introduces MTP-D, a self-distillation method that improves multi-token prediction in large language models, significantly boosting inference speed and head acceptance rates with minimal additional training.
Contribution
The paper proposes MTP-D and a looped extension strategy, advancing multi-token prediction techniques for faster, more efficient large language model inference.
Findings
MTP-D increases acceptance rates by +7.5%.
Looped extension achieves +220.4% speedup with 1-head MTP.
Extensive experiments validate the effectiveness across seven benchmarks.
Abstract
As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning in Healthcare
