Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Ethan Shen, Alan Fan, Sarah M. Pratt, Jae Sung Park, Matthew, Wallingford, Sham M. Kakade, Ari Holtzman, Ranjay Krishna, Ali Farhadi,, Aditya Kusupati

TL;DR
Superposed Decoding enables the generation of multiple drafts from a single language model inference pass, significantly reducing computational costs while maintaining quality, coherence, and factual accuracy compared to traditional methods.
Contribution
The paper introduces Superposed Decoding, a novel algorithm that produces multiple drafts simultaneously during one inference pass, reducing computational costs without sacrificing quality.
Findings
At least 2.44× faster for k≥3 drafts.
Drafts are as coherent and factual as existing sampling methods.
User evaluations favor Superposed Decoding over Nucleus Sampling.
Abstract
Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing drafts to the user requires running an expensive language model times. To alleviate the computation cost of running inference passes, we propose Superposed Decoding, a new decoding algorithm that generates drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the drafts as input to the next decoding step of the language model. At every inference step we combine the drafts with the top- tokens to get new drafts and cache the most likely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
