The N-Grammys: Accelerating Autoregressive Inference with Learning-Free   Batched Speculation

Lawrence Stewart (SIERRA); Matthew Trager; Sujan Kumar Gonugondla,; Stefano Soatto (UCLA-CS)

arXiv:2411.03786·cs.LG·November 7, 2024

The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation

Lawrence Stewart (SIERRA), Matthew Trager, Sujan Kumar Gonugondla,, Stefano Soatto (UCLA-CS)

PDF

Open Access

TL;DR

This paper introduces a learning-free speculative decoding method using N-gram strategies to accelerate autoregressive language model inference, achieving significant speedups with minimal overhead.

Contribution

It demonstrates that simple, learning-free N-gram based strategies can effectively accelerate autoregressive inference without modifying the base model.

Findings

01

Achieves substantial inference speedups across various tasks.

02

Performance comparable to complex methods without preprocessing.

03

Easy integration into existing pipelines.

Abstract

Speculative decoding aims to speed up autoregressive generation of a language model by verifying in parallel the tokens generated by a smaller draft model.In this work, we explore the effectiveness of learning-free, negligible-cost draft strategies, namely $N$ -grams obtained from the model weights and the context. While the predicted next token of the base model is rarely the top prediction of these simple strategies, we observe that it is often within their top- $k$ predictions for small $k$ . Based on this, we show that combinations of simple strategies can achieve significant inference speedups over different tasks. The overall performance is comparable to more complex methods, yet does not require expensive preprocessing or modification of the base model, and allows for seamless `plug-and-play' integration into pipelines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsBalanced Selection · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings