FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill

Rakshith Jayanth; Viktor Prasanna

arXiv:2602.20515·cs.AR·February 25, 2026

FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill

Rakshith Jayanth, Viktor Prasanna

PDF

Open Access

TL;DR

FAST-Prefill introduces an FPGA-based accelerator that significantly speeds up and improves energy efficiency for long-context sparse attention inference in large language models, addressing memory and dynamic sparsity challenges.

Contribution

It is the first FPGA accelerator designed specifically for dynamic sparse attention prefill in long-context LLM inference, with novel memory-aware and hybrid processing architectures.

Findings

01

Up to 2.5× speedup in TTFT over GPU

02

Up to 4.5× energy efficiency improvement

03

Effective handling of dynamic sparse attention patterns

Abstract

In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's interactions to a subset of tokens. The attention sparsity pattern varies across input prompts, and within a prompt, each attention head can follow a distinct pattern. This makes attention sparsity dynamic. The requirement of generating the sparsity pattern, combined with limited data reuse in attention, shifts the prefill compute to being memory-bound. This, in addition to the huge energy requirements for long-context inference on GPU, motivates FPGAs as good candidates for accelerating dynamic long-context inference. To tackle these challenges, we propose FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy