Multi-Token Prediction via Self-Distillation

John Kirchenbauer; Abhimanyu Hans; Brian Bartoldson; Micah Goldblum; Ashwinee Panda; Tom Goldstein

arXiv:2602.06019·cs.CL·April 27, 2026

Multi-Token Prediction via Self-Distillation

John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein

PDF

3 Models

TL;DR

This paper introduces a simple online distillation method to convert pretrained autoregressive language models into faster multi-token predictors without extra inference components.

Contribution

It presents a novel approach that transforms existing models into multi-token predictors with minimal accuracy loss and no additional inference complexity.

Findings

01

Models decode over 3 times faster.

02

Achieves less than 5% accuracy drop on GSM8K.

03

No auxiliary verifier needed for deployment.

Abstract

Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. Our method produces models that decode more than $3 \times$ faster at $< 5%$ drop in accuracy on GSM8K relative to the single token decoding performance of the same checkpoint.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.