The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
Lawrence Stewart (SIERRA), Matthew Trager, Sujan Kumar Gonugondla,, Stefano Soatto (UCLA-CS)

TL;DR
This paper introduces a learning-free speculative decoding method using N-gram strategies to accelerate autoregressive language model inference, achieving significant speedups with minimal overhead.
Contribution
It demonstrates that simple, learning-free N-gram based strategies can effectively accelerate autoregressive inference without modifying the base model.
Findings
Achieves substantial inference speedups across various tasks.
Performance comparable to complex methods without preprocessing.
Easy integration into existing pipelines.
Abstract
Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely -grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top- predictions for small . Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsBalanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
