Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM   Decoding

Seongjun Yang; Gibbeum Lee; Jaewoong Cho; Dimitris Papailiopoulos,; Kangwook Lee

arXiv:2307.05908·cs.CL·July 30, 2024·1 cites

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos,, Kangwook Lee

PDF

Open Access

TL;DR

This paper introduces Predictive Pipelined Decoding (PPD), a method that accelerates exact greedy decoding in Large Language Models by parallelizing token generation, balancing compute resources and latency through a new theoretical framework.

Contribution

The paper proposes PPD, a novel decoding approach that uses additional compute to reduce latency while maintaining output accuracy, supported by a theoretical analysis and preliminary experiments.

Findings

01

PPD reduces decoding latency in LLMs.

02

Theoretical framework estimates latency reduction potential.

03

Preliminary experiments validate PPD's effectiveness.

Abstract

This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD employs additional compute resources to parallelize the initiation of subsequent token decoding during the current token decoding. This method reduces decoding latency and reshapes the understanding of trade-offs in LLM decoding strategies. We have developed a theoretical framework that allows us to analyze the trade-off between computation and latency. Using this framework, we can analytically estimate the potential reduction in latency associated with our proposed method, achieved through the assessment of the match rate, represented as p_correct. The results demonstrate that the use of extra computational resources has the potential to accelerate LLM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis