PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration
Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang

TL;DR
PD-Swap is a reconfigurable FPGA accelerator that dynamically swaps between prefill and decode modes to efficiently run large language models with long contexts on edge devices, significantly improving throughput.
Contribution
It introduces a disaggregated accelerator architecture using Dynamic Partial Reconfiguration to optimize resource utilization for different LLM inference phases on edge FPGAs.
Findings
Achieves up to 27 tokens/sec decoding throughput.
Outperforms prior state-of-the-art by 1.3x to 2.1x.
Effectively handles longer context lengths with improved efficiency.
Abstract
Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy
