Superposed Decoding: Multiple Generations from a Single Autoregressive   Inference Pass

Ethan Shen; Alan Fan; Sarah M. Pratt; Jae Sung Park; Matthew; Wallingford; Sham M. Kakade; Ari Holtzman; Ranjay Krishna; Ali Farhadi,; Aditya Kusupati

arXiv:2405.18400·cs.CL·November 1, 2024·1 cites

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew, Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi,, Aditya Kusupati

PDF

Open Access 1 Repo

TL;DR

Superposed Decoding enables the generation of multiple drafts from a single language model inference pass, significantly reducing computational costs while maintaining quality, coherence, and factual accuracy compared to traditional methods.

Contribution

The paper introduces Superposed Decoding, a novel algorithm that produces multiple drafts simultaneously during one inference pass, reducing computational costs without sacrificing quality.

Findings

01

At least 2.44× faster for k≥3 drafts.

02

Drafts are as coherent and factual as existing sampling methods.

03

User evaluations favor Superposed Decoding over Nucleus Sampling.

Abstract

Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top- $k$ tokens to get $k^{2}$ new drafts and cache the $k$ most likely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

raivnlab/superposeddecoding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression