DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token   Sampling

Shikhar Tuli; Chi-Heng Lin; Yen-Chang Hsu; Niraj K. Jha; Yilin Shen,; Hongxia Jin

arXiv:2405.00888·cs.CL·May 3, 2024

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen,, Hongxia Jin

PDF

Open Access 1 Video

TL;DR

DynaMo introduces dynamic multi-token prediction models that significantly accelerate language model inference by predicting multiple tokens simultaneously, maintaining quality while reducing inference time.

Contribution

The paper presents a novel dynamic multi-token prediction framework that leverages confidence-based sampling and efficient training techniques to speed up language model inference.

Findings

01

DynaMo-7.3B-T3 achieves 2.57× faster inference than baseline.

02

Maintains same text quality as autoregressive models.

03

Introduces methods to improve joint probability estimation for better generation.

Abstract

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $dynamically$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis