PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined   Speculation

Branden Butler; Sixing Yu; Arya Mazaheri; and Ali Jannesari

arXiv:2407.11798·cs.CL·November 19, 2024

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Branden Butler, Sixing Yu, Arya Mazaheri, and Ali Jannesari

PDF

Open Access

TL;DR

PipeInfer introduces a pipelined speculative inference method for LLMs that reduces latency and improves throughput by running multiple speculative inferences concurrently and skipping invalidated computations.

Contribution

It proposes PipeInfer, a novel pipelined speculative inference technique that enhances LLM inference speed and system utilization, especially for single-request scenarios.

Findings

01

Up to 2.15× speedup over standard methods

02

Effective in low-bandwidth and low-speculation acceptance scenarios

03

Reduces inter-token latency significantly

Abstract

Inference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high speculation acceptance rates to improve performance. Combined with a variable rate of acceptance across tasks, speculative inference techniques can result in reduced performance. Additionally, pipeline-parallel designs require many user requests to maintain maximum utilization. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Storage Technologies · Algorithms and Data Compression · Internet Traffic Analysis and Secure E-voting

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings