DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling
Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen,, Hongxia Jin

TL;DR
DynaMo introduces dynamic multi-token prediction models that significantly accelerate language model inference by predicting multiple tokens simultaneously, maintaining quality while reducing inference time.
Contribution
The paper presents a novel dynamic multi-token prediction framework that leverages confidence-based sampling and efficient training techniques to speed up language model inference.
Findings
DynaMo-7.3B-T3 achieves 2.57× faster inference than baseline.
Maintains same text quality as autoregressive models.
Introduces methods to improve joint probability estimation for better generation.
Abstract
Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
