PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Branden Butler, Sixing Yu, Arya Mazaheri, and Ali Jannesari

TL;DR
PipeInfer introduces a pipelined speculative inference method for LLMs that reduces latency and improves throughput by running multiple speculative inferences concurrently and skipping invalidated computations.
Contribution
It proposes PipeInfer, a novel pipelined speculative inference technique that enhances LLM inference speed and system utilization, especially for single-request scenarios.
Findings
Up to 2.15× speedup over standard methods
Effective in low-bandwidth and low-speculation acceptance scenarios
Reduces inter-token latency significantly
Abstract
Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Internet Traffic Analysis and Secure E-voting
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
