SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt, Keutzer, Amir Gholami, Sophia Shao

TL;DR
SPEED introduces a speculative pipelined execution method that accelerates generative LLM inference by parallelizing token generation using predictions, reducing latency while maintaining accuracy.
Contribution
The paper presents SPEED, a novel approach that enables parallel speculative execution of tokens in Transformer-based LLMs, improving inference speed without significant accuracy loss.
Findings
Significant latency reduction in LLM inference.
Effective parallelization of token generation using speculation.
Ability to train deeper decoders with minimal overhead.
Abstract
Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios has been highly restricted due to the significant inference latency associated with these models. This is particularly pronounced due to the autoregressive nature of generative LLM inference, where tokens are generated sequentially since each token depends on all previous output tokens. It is therefore challenging to achieve any token-level parallelism, making inference extremely memory-bound. In this work, we propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token using predicted values based on early-layer hidden states. For Transformer decoders that employ parameter…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Residual Connection · Absolute Position Encodings · Layer Normalization · Softmax · Adam · Byte Pair Encoding
