SpeedLLM: An FPGA Co-design of Large Language Model Inference Accelerator

Peipei Wang; Wu Guan; Liping Liang; Zhijun Wang; Hanqing Luo; Zhibin Zhang

arXiv:2507.14139·cs.AR·July 22, 2025

SpeedLLM: An FPGA Co-design of Large Language Model Inference Accelerator

Peipei Wang, Wu Guan, Liping Liang, Zhijun Wang, Hanqing Luo, Zhibin Zhang

PDF

TL;DR

SpeedLLM is an FPGA-based accelerator optimized for Tinyllama, significantly improving inference speed and energy efficiency for edge computing by employing innovative data streaming, memory reuse, and operator fusion techniques.

Contribution

The paper presents a novel FPGA co-design with data stream parallelism, memory reuse, and operator fusion tailored for LLM inference, enhancing performance and energy efficiency.

Findings

01

Up to 4.8x faster inference performance

02

1.18x lower energy consumption

03

Optimized for edge computing applications

Abstract

This paper introduces SpeedLLM, a neural network accelerator designed on the Xilinx Alevo U280 platform and optimized for the Tinyllama framework to enhance edge computing performance. Key innovations include data stream parallelism, a memory reuse strategy, and Llama2 operator fusion, which collectively reduce latency and energy consumption. SpeedLLM's data pipeline architecture optimizes the read-compute-write cycle, while the memory strategy minimizes FPGA resource demands. The operator fusion boosts computational density and throughput. Results show SpeedLLM outperforms traditional Tinyllama implementations, achieving up to 4.8* faster performance and 1.18* lower energy consumption, offering improvements in edge devices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.