FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design

Jiahao Zhang; Zifan He; Nicholas Fraser; Michaela Blott; Yizhou Sun; Jason Cong

arXiv:2601.15710·cs.AR·January 23, 2026

FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design

Jiahao Zhang, Zifan He, Nicholas Fraser, Michaela Blott, Yizhou Sun, Jason Cong

PDF

Open Access

TL;DR

FlexLLM introduces a flexible, composable HLS library enabling rapid development of hybrid LLM accelerators with customizable architecture, quantization, and long-context processing, achieving significant speed and efficiency improvements on FPGA hardware.

Contribution

The paper presents FlexLLM, a novel HLS library that simplifies and accelerates the design of domain-specific LLM accelerators with customizable features and integrated long-context support.

Findings

01

Achieves 1.29× end-to-end speedup over GPU on FPGA.

02

Surpasses baseline quantization in perplexity metrics.

03

Extends context window and reduces latency significantly.

Abstract

We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29 $\times$ end-to-end speedup, 1.64 $\times$ higher decode…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Neural Network Applications