SPEED: Speculative Pipelined Execution for Efficient Decoding

Coleman Hooper; Sehoon Kim; Hiva Mohammadzadeh; Hasan Genc; Kurt; Keutzer; Amir Gholami; Sophia Shao

arXiv:2310.12072·cs.CL·January 4, 2024·1 cites

SPEED: Speculative Pipelined Execution for Efficient Decoding

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt, Keutzer, Amir Gholami, Sophia Shao

PDF

Open Access

TL;DR

SPEED introduces a speculative pipelined execution method that accelerates generative LLM inference by parallelizing token generation using predictions, reducing latency while maintaining accuracy.

Contribution

The paper presents SPEED, a novel approach that enables parallel speculative execution of tokens in Transformer-based LLMs, improving inference speed without significant accuracy loss.

Findings

01

Significant latency reduction in LLM inference.

02

Effective parallelization of token generation using speculation.

03

Ability to train deeper decoders with minimal overhead.

Abstract

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Residual Connection · Absolute Position Encodings · Layer Normalization · Softmax · Adam · Byte Pair Encoding