HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis
Andy He, Darren Key, Mason Bulling, Andrew Chang, Skyler Shapiro,, Everett Lee

TL;DR
This paper presents HLSTransform, an FPGA-based accelerator for Llama 2 transformers that significantly reduces energy consumption and increases inference speed using high level synthesis, making FPGA deployment more accessible.
Contribution
We developed an open-source FPGA accelerator for Llama 2 transformers using high level synthesis, achieving substantial energy savings and competitive inference speeds.
Findings
Up to 12.75x energy reduction compared to CPU
Up to 8.25x energy reduction compared to GPU
Inference speed increased by up to 2.46x over CPU
Abstract
Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Embedded Systems Design Techniques · Digital Filter Design and Implementation
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Balanced Selection · LLaMA · VirTex
