SpeedLLM: An FPGA Co-design of Large Language Model Inference Accelerator
Peipei Wang, Wu Guan, Liping Liang, Zhijun Wang, Hanqing Luo, Zhibin Zhang

TL;DR
SpeedLLM is an FPGA-based accelerator optimized for Tinyllama, significantly improving inference speed and energy efficiency for edge computing by employing innovative data streaming, memory reuse, and operator fusion techniques.
Contribution
The paper presents a novel FPGA co-design with data stream parallelism, memory reuse, and operator fusion tailored for LLM inference, enhancing performance and energy efficiency.
Findings
Up to 4.8x faster inference performance
1.18x lower energy consumption
Optimized for edge computing applications
Abstract
This paper introduces SpeedLLM, a neural network accelerator designed on the Xilinx Alevo U280 platform and optimized for the Tinyllama framework to enhance edge computing performance. Key innovations include data stream parallelism, a memory reuse strategy, and Llama2 operator fusion, which collectively reduce latency and energy consumption. SpeedLLM's data pipeline architecture optimizes the read-compute-write cycle, while the memory strategy minimizes FPGA resource demands. The operator fusion boosts computational density and throughput. Results show SpeedLLM outperforms traditional Tinyllama implementations, achieving up to 4.8* faster performance and 1.18* lower energy consumption, offering improvements in edge devices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
