FIRP: Faster LLM inference via future intermediate representation   prediction

Pengfei Wu; Jiahao Liu; Zhuocheng Gong; Qifan Wang; Jinpeng Li,; Jingang Wang; Xunliang Cai; Dongyan Zhao

arXiv:2410.20488·cs.CL·October 29, 2024

FIRP: Faster LLM inference via future intermediate representation prediction

Pengfei Wu, Jiahao Liu, Zhuocheng Gong, Qifan Wang, Jinpeng Li,, Jingang Wang, Xunliang Cai, Dongyan Zhao

PDF

Open Access

TL;DR

FIRP is a speculative decoding method that predicts future hidden states in LLMs to enable parallel token generation, significantly reducing inference latency by 1.9x to 3x.

Contribution

The paper introduces FIRP, a novel approach that predicts intermediate hidden states to accelerate LLM inference through parallel decoding.

Findings

01

Achieves 1.9x to 3x speedup in inference latency

02

Effectively predicts future hidden states with simple linear transformations

03

Narrowing semantic gap improves decoding accuracy

Abstract

Recent advancements in Large Language Models (LLMs) have shown remarkable performance across a wide range of tasks. Despite this, the auto-regressive nature of LLM decoding, which generates only a single token per forward propagation, fails to fully exploit the parallel computational power of GPUs, leading to considerable latency. To address this, we introduce a novel speculative decoding method named FIRP which generates multiple tokens instead of one at each decoding step. We achieve this by predicting the intermediate hidden states of future tokens (tokens have not been decoded yet) and then using these pseudo hidden states to decode future tokens, specifically, these pseudo hidden states are predicted with simple linear transformation in intermediate layers of LLMs. Once predicted, they participate in the computation of all the following layers, thereby assimilating richer semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques · Speech Recognition and Synthesis