FAST-Prefill: FPGA Accelerated Sparse Attention for Long Context LLM Prefill
Rakshith Jayanth, Viktor Prasanna

TL;DR
FAST-Prefill introduces an FPGA-based accelerator that significantly speeds up and improves energy efficiency for long-context sparse attention inference in large language models, addressing memory and dynamic sparsity challenges.
Contribution
It is the first FPGA accelerator designed specifically for dynamic sparse attention prefill in long-context LLM inference, with novel memory-aware and hybrid processing architectures.
Findings
Up to 2.5× speedup in TTFT over GPU
Up to 4.5× energy efficiency improvement
Effective handling of dynamic sparse attention patterns
Abstract
In long-context large language model (LLM) inference, the prefill stage dominates computation due to self-attention over the complete input context. Sparse attention significantly reduces self-attention computation by limiting each token's interactions to a subset of tokens. The attention sparsity pattern varies across input prompts, and within a prompt, each attention head can follow a distinct pattern. This makes attention sparsity dynamic. The requirement of generating the sparsity pattern, combined with limited data reuse in attention, shifts the prefill compute to being memory-bound. This, in addition to the huge energy requirements for long-context inference on GPU, motivates FPGAs as good candidates for accelerating dynamic long-context inference. To tackle these challenges, we propose FAST-Prefill, the first FPGA accelerator for long-context prefill-stage inference with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
