TL;DR
This paper introduces a simple online distillation method to convert pretrained autoregressive language models into faster multi-token predictors without extra inference components.
Contribution
It presents a novel approach that transforms existing models into multi-token predictors with minimal accuracy loss and no additional inference complexity.
Findings
Models decode over 3 times faster.
Achieves less than 5% accuracy drop on GSM8K.
No auxiliary verifier needed for deployment.
Abstract
Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. Our method produces models that decode more than faster at drop in accuracy on GSM8K relative to the single token decoding performance of the same checkpoint.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
