On multi-token prediction for efficient LLM inference
Somesh Mehra, Javier Alonso Garcia, Lukas Mauch

TL;DR
This paper explores multi-token prediction in large language models, revealing inherent capabilities, challenges in integration, and potential strategies for faster inference, with implications for future model design and optimization.
Contribution
It systematically analyzes MTP capabilities in pre-trained LLMs, highlighting inherent abilities, integration challenges, and the impact of joint training on performance.
Findings
LLMs inherently possess MTP capabilities via numerical marginalization.
Integrating MTP heads into frozen LLMs is challenging due to layer specialization.
Joint training improves MTP performance but does not fully overcome adaptation barriers.
Abstract
We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
