FlexLLM: Composable HLS Library for Flexible Hybrid LLM Accelerator Design
Jiahao Zhang, Zifan He, Nicholas Fraser, Michaela Blott, Yizhou Sun, Jason Cong

TL;DR
FlexLLM introduces a flexible, composable HLS library enabling rapid development of hybrid LLM accelerators with customizable architecture, quantization, and long-context processing, achieving significant speed and efficiency improvements on FPGA hardware.
Contribution
The paper presents FlexLLM, a novel HLS library that simplifies and accelerates the design of domain-specific LLM accelerators with customizable features and integrated long-context support.
Findings
Achieves 1.29× end-to-end speedup over GPU on FPGA.
Surpasses baseline quantization in perplexity metrics.
Extends context window and reduces latency significantly.
Abstract
We present FlexLLM, a composable High-Level Synthesis (HLS) library for rapid development of domain-specific LLM accelerators. FlexLLM exposes key architectural degrees of freedom for stage-customized inference, enabling hybrid designs that tailor temporal reuse and spatial dataflow differently for prefill and decode, and provides a comprehensive quantization suite to support accurate low-bit deployment. Using FlexLLM, we build a complete inference system for the Llama-3.2 1B model in under two months with only 1K lines of code. The system includes: (1) a stage-customized accelerator with hardware-efficient quantization (12.68 WikiText-2 PPL) surpassing SpinQuant baseline, and (2) a Hierarchical Memory Transformer (HMT) plug-in for efficient long-context processing. On the AMD U280 FPGA at 16nm, the accelerator achieves 1.29 end-to-end speedup, 1.64 higher decode…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Embedded Systems Design Techniques · Advanced Neural Network Applications
