SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

Zicheng He; Anhao Zhao; Xiaoyu Shen; Chen Wu; Lei He

arXiv:2603.14785·cs.AR·March 17, 2026

SkipOPU: An FPGA-based Overlay Processor for Large Language Models with Dynamically Allocated Computation

Zicheng He, Anhao Zhao, Xiaoyu Shen, Chen Wu, Lei He

PDF

Open Access

TL;DR

SkipOPU is an FPGA-based overlay processor that enables dynamic computation allocation for large language models, significantly improving inference efficiency and reducing storage overhead by exploiting token and layer variability.

Contribution

It introduces a flexible FPGA overlay architecture supporting dynamic inference patterns, including a novel reduction decoupling, hybrid precision support, and on-chip KV buffer for efficient LLM inference.

Findings

01

Outperforms GPU and FPGA accelerators by 1.23x-3.83x in bandwidth efficiency.

02

Reduces KV storage overhead by up to 25.4%.

03

Supports dynamic computation allocation with high flexibility.

Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their inference efficiency remains a critical bottleneck due to rapidly growing parameters. Recent advances in dynamic computation allocation address this challenge by exploiting the highly uneven contributions of different tokens and layers, enabling selective execution that significantly reduces redundant computation while preserving model accuracy. However, existing hardware platforms and accelerators are primarily optimized for uniform, static execution, limiting their ability to efficiently support such dynamic inference patterns. In this work, we propose SkipOPU, an FPGA-based overlay processor that dynamically allocates computation across tokens and layers with high flexibility through a lightweight routing mechanism. First, we decouple reduction operations from element-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Natural Language Processing Techniques · Parallel Computing and Optimization Techniques