EdgeLLM: A Highly Efficient CPU-FPGA Heterogeneous Edge Accelerator for Large Language Models
Mingqiang Huang, Ao Shen, Kai Li, Haoxiang Peng, Boyu Li, Yupeng Su,, Hao Yu

TL;DR
EdgeLLM introduces a CPU-FPGA heterogeneous accelerator that significantly improves the efficiency and throughput of deploying large language models on resource-constrained edge devices, addressing computation, memory, and deployment challenges.
Contribution
The paper presents a novel heterogeneous accelerator architecture, a universal data parallelism scheme, and an end-to-end compiler for efficient LLM deployment on edge devices.
Findings
Achieves 1.91x higher throughput than NVIDIA A100 GPU.
Attains 7.55x higher energy efficiency compared to GPU.
Outperforms state-of-the-art FPGA accelerators by 10-24% in performance.
Abstract
The rapid advancements in artificial intelligence (AI), particularly the Large Language Models (LLMs), have profoundly affected our daily work and communication forms. However, it is still a challenge to deploy LLMs on resource-constrained edge devices (such as robots), due to the intensive computation requirements, heavy memory access, diverse operator types and difficulties in compilation. In this work, we proposed EdgeLLM to address the above issues. Firstly, focusing on the computation, we designed mix-precision processing element array together with group systolic architecture, that can efficiently support both FP16*FP16 for the MHA block (Multi-Head Attention) and FP16*INT4 for the FFN layer (Feed-Forward Network). Meanwhile specific optimization on log-scale structured weight sparsity, has been used to further increase the efficiency. Secondly, to address the compilation and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Neural Networks and Applications · Topic Modeling
