On multi-token prediction for efficient LLM inference

Somesh Mehra; Javier Alonso Garcia; Lukas Mauch

arXiv:2502.09419·cs.CL·February 14, 2025

On multi-token prediction for efficient LLM inference

Somesh Mehra, Javier Alonso Garcia, Lukas Mauch

PDF

Open Access

TL;DR

This paper explores multi-token prediction in large language models, revealing inherent capabilities, challenges in integration, and potential strategies for faster inference, with implications for future model design and optimization.

Contribution

It systematically analyzes MTP capabilities in pre-trained LLMs, highlighting inherent abilities, integration challenges, and the impact of joint training on performance.

Findings

01

LLMs inherently possess MTP capabilities via numerical marginalization.

02

Integrating MTP heads into frozen LLMs is challenging due to layer specialization.

03

Joint training improves MTP performance but does not fully overcome adaptation barriers.

Abstract

We systematically investigate multi-token prediction (MTP) capabilities within LLMs pre-trained for next-token prediction (NTP). We first show that such models inherently possess MTP capabilities via numerical marginalization over intermediate token probabilities, though performance is data-dependent and improves with model scale. Furthermore, we explore the challenges of integrating MTP heads into frozen LLMs and find that their hidden layers are strongly specialized for NTP, making adaptation non-trivial. Finally, we show that while joint training of MTP heads with the backbone improves performance, it cannot fully overcome this barrier, prompting further research in this direction. Our findings provide a deeper understanding of MTP applied to pretrained LLMs, informing strategies for accelerating inference through parallel token prediction.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques