AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Yanbiao Liang, Huihong Shi, Haikuo Shao, and Zhongfeng Wang

TL;DR
AccLLM is a co-designed algorithm-hardware framework that significantly accelerates long-context LLM inference on edge devices by combining pruning, specialized attention, quantization, and FPGA acceleration.
Contribution
It introduces a novel integrated approach combining algorithmic optimizations and FPGA hardware design for efficient long-sequence LLM inference on resource-limited devices.
Findings
4.07x energy efficiency improvement on FPGA
2.98x throughput increase over state-of-the-art
Effective reduction in memory and bandwidth usage
Abstract
Recently, large language models (LLMs) have achieved huge success in the natural language processing (NLP) field, driving a growing demand to extend their deployment from the cloud to edge devices. However, deploying LLMs on resource-constrained edge devices poses significant challenges, including (1) intensive computations and huge model sizes, (2) great memory and bandwidth demands introduced by the autoregressive generation process, and (3) limited scalability for handling long sequences. To address these challenges, we propose AccLLM, a comprehensive acceleration framework that enables efficient and fast long-context LLM inference through algorithm and hardware co-design. At the algorithmic level, we integrate (1) pruning, (2) {\Lambda}-shaped attention, and (3) an innovative W2A8KV4 (2-bit weights, 8-bit activations, and 4-bit KV cache) quantization scheme, thus effectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Big Data and Digital Economy · Topic Modeling
