Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu; Beidi Chen; Ce Zhang

arXiv:2502.02789·cs.CL·May 21, 2025

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation

Jingyu Liu, Beidi Chen, Ce Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SpecPrefill, a training-free method that significantly reduces time-to-first-token in large language model inference by selecting important prompt tokens, boosting QPS and TTFT without additional training.

Contribution

It proposes a novel lightweight, training-free framework that accelerates TTFT by identifying and preselecting important tokens for LLM inference, shifting focus from compute-heavy attention to token importance estimation.

Findings

01

Achieves up to 7.66× TTFT improvement.

02

Serves Llama-3.1-405B-Instruct-FP8 with 7× higher QPS.

03

Demonstrates effectiveness across diverse tasks.

Abstract

Improving time-to-first-token (TTFT) is an essentially important objective in modern large language model (LLM) inference engines. Optimizing TTFT directly results in higher maximal QPS and meets the requirements of many critical applications. However, boosting TTFT is notoriously challenging since it is compute-bounded and the performance bottleneck shifts from the self-attention that many prior works focus on to the MLP part. In this work, we present SpecPrefill, a training free framework that accelerates the inference TTFT for both long and medium context queries based on the following insight: LLMs are generalized enough to preserve the quality given only a carefully chosen subset of prompt tokens. At its core, SpecPrefill leverages a lightweight model to speculate locally important tokens based on the context. These tokens, along with the necessary positional information, are then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Jingyu6/speculative_prefill
pytorchOfficial

Videos

Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation· slideslive

Taxonomy

TopicsAdvancements in Semiconductor Devices and Circuit Design · Integrated Circuits and Semiconductor Failure Analysis · Thin-Film Transistor Technologies

MethodsFocus · Sparse Evolutionary Training