Learning to Keep a Promise: Scaling Language Model Decoding Parallelism   with Learned Asynchronous Decoding

Tian Jin; Ellie Y. Cheng; Zack Ankner; Nikunj Saunshi; Blake M. Elias,; Amir Yazdanbakhsh; Jonathan Ragan-Kelley; Suvinay Subramanian; Michael Carbin

arXiv:2502.11517·cs.CL·February 24, 2025

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias,, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

PDF

Open Access 1 Video

TL;DR

This paper introduces PASTA, a learning-based system that enables large language models to identify semantic independence and perform parallel decoding, significantly improving decoding speed while maintaining response quality.

Contribution

PASTA provides a novel annotation language and training method allowing LLMs to express and utilize semantic independence for faster parallel decoding.

Findings

01

Achieves up to 1.93x speedup in decoding

02

Maintains comparable response quality with baseline

03

Outperforms existing parallel decoding methods

Abstract

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Neural Networks and Applications · Machine Learning and Algorithms

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings