PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration

Yifan Zhang; Zhiheng Chen; Ye Qiao; and Sitao Huang

arXiv:2512.11550·cs.AR·December 15, 2025

PD-Swap: Prefill-Decode Logic Swapping for End-to-End LLM Inference on Edge FPGAs via Dynamic Partial Reconfiguration

Yifan Zhang, Zhiheng Chen, Ye Qiao, and Sitao Huang

PDF

Open Access

TL;DR

PD-Swap is a reconfigurable FPGA accelerator that dynamically swaps between prefill and decode modes to efficiently run large language models with long contexts on edge devices, significantly improving throughput.

Contribution

It introduces a disaggregated accelerator architecture using Dynamic Partial Reconfiguration to optimize resource utilization for different LLM inference phases on edge FPGAs.

Findings

01

Achieves up to 27 tokens/sec decoding throughput.

02

Outperforms prior state-of-the-art by 1.3x to 2.1x.

03

Effectively handles longer context lengths with improved efficiency.

Abstract

Aggressively quantized large language models (LLMs), such as BitNet-style 1.58-bit Transformers with ternary weights, make it feasible to deploy generative AI on low-power edge FPGAs. However, as prompts grow to tens of thousands of tokens, edge hardware performance drops sharply with sequence length due to quadratic prefill cost and rapidly increasing KV-cache bandwidth demands, making inference latency of longer context length a first-order system concern. Recent studies on LLMs expose a fundamental prefill-decode asymmetry: prefill is compute-bound and dominated by dense matrix-matrix operations, whereas decoding is memory-bandwidth-bound and dominated by KV-cache traffic. A static accelerator must provision resources and a single dataflow for both regimes, leading to duplicated attention logic, underutilized fabric, and tight LUT/URAM limits that cap model size and usable context.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmbedded Systems Design Techniques · Parallel Computing and Optimization Techniques · Big Data and Digital Economy