Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity

Aayush Gautam; Mukul Gagrani; Junyoung Park; Mingu Lee; Chiris Lott; Narasimha Reddy

arXiv:2602.00397·cs.LG·February 3, 2026

Fast Forward: Accelerating LLM Prefill with Predictive FFN Sparsity

Aayush Gautam, Mukul Gagrani, Junyoung Park, Mingu Lee, Chiris Lott, Narasimha Reddy

PDF

Open Access

TL;DR

FastForward introduces a predictive sparsity framework for LLM prefill, significantly accelerating inference by selectively sparsifying FFNs with minimal accuracy loss, especially beneficial for long-context workloads.

Contribution

It proposes a novel, context-aware FFN sparsification method combining prediction, error correction, and scheduling to improve prefill speed without degrading accuracy.

Findings

01

Achieves up to 1.45× speedup at 50% FFN sparsity

02

Maintains less than 6% accuracy loss on LongBench

03

Reduces Time-to-First-Token for long-context inference

Abstract

The prefill stage of large language model (LLM) inference is a key computational bottleneck for long-context workloads. At short-to-moderate context lengths (1K--16K tokens), Feed-Forward Networks (FFNs) dominate this cost, accounting for most of the total FLOPs. Existing FFN sparsification methods, designed for autoregressive decoding, fail to exploit the prefill stage's parallelism and often degrade accuracy. To address this, we introduce FastForward, a predictive sparsity framework that accelerates LLM prefill through block-wise, context-aware FFN sparsity. FastForward combines (1) a lightweight expert predictor to select high-importance neurons per block, (2) an error compensation network to correct sparsity-induced errors, and (3) a layer-wise sparsity scheduler to allocate compute based on token-mixing importance. Across LLaMA and Qwen models up to 8B parameters, FastForward…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Speech Recognition and Synthesis