Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
Jingyu Liu, Beidi Chen, Ce Zhang

TL;DR
This paper introduces SpecPrefill, a training-free method that significantly reduces time-to-first-token in large language model inference by selecting important prompt tokens, boosting QPS and TTFT without additional training.
Contribution
It proposes a novel lightweight, training-free framework that accelerates TTFT by identifying and preselecting important tokens for LLM inference, shifting focus from compute-heavy attention to token importance estimation.
Findings
Achieves up to 7.66× TTFT improvement.
Serves Llama-3.1-405B-Instruct-FP8 with 7× higher QPS.
Demonstrates effectiveness across diverse tasks.
Abstract
Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvancements in Semiconductor Devices and Circuit Design · Integrated Circuits and Semiconductor Failure Analysis · Thin-Film Transistor Technologies
MethodsFocus · Sparse Evolutionary Training
