FIRP: Faster LLM inference via future intermediate representation prediction
Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li,, Jingang Wang, Xunliang Cai, Dongyan Zhao

TL;DR
FIRP is a speculative decoding method that predicts future hidden states in LLMs to enable parallel token generation, significantly reducing inference latency by 1.9x to 3x.
Contribution
The paper introduces FIRP, a novel approach that predicts intermediate hidden states to accelerate LLM inference through parallel decoding.
Findings
Achieves 1.9x to 3x speedup in inference latency
Effectively predicts future hidden states with simple linear transformations
Narrowing semantic gap improves decoding accuracy
Abstract
Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis
