AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware   Co-Design

Yanbiao Liang; Huihong Shi; Haikuo Shao; and Zhongfeng Wang

arXiv:2505.03745·cs.AR·May 8, 2025

AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design

Yanbiao Liang, Huihong Shi, Haikuo Shao, and Zhongfeng Wang

PDF

Open Access

TL;DR

AccLLM is a co-designed algorithm-hardware framework that significantly accelerates long-context LLM inference on edge devices by combining pruning, specialized attention, quantization, and FPGA acceleration.

Contribution

It introduces a novel integrated approach combining algorithmic optimizations and FPGA hardware design for efficient long-sequence LLM inference on resource-limited devices.

Findings

01

4.07x energy efficiency improvement on FPGA

02

2.98x throughput increase over state-of-the-art

03

Effective reduction in memory and bandwidth usage

Abstract

Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) {\Lambda}-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling