HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level   Synthesis

Andy He; Darren Key; Mason Bulling; Andrew Chang; Skyler Shapiro,; Everett Lee

arXiv:2405.00738·cs.AR·May 3, 2024·1 cites

HLSTransform: Energy-Efficient Llama 2 Inference on FPGAs Via High Level Synthesis

Andy He, Darren Key, Mason Bulling, Andrew Chang, Skyler Shapiro,, Everett Lee

PDF

Open Access 1 Repo

TL;DR

This paper presents HLSTransform, an FPGA-based accelerator for Llama 2 transformers that significantly reduces energy consumption and increases inference speed using high level synthesis, making FPGA deployment more accessible.

Contribution

We developed an open-source FPGA accelerator for Llama 2 transformers using high level synthesis, achieving substantial energy savings and competitive inference speeds.

Findings

01

Up to 12.75x energy reduction compared to CPU

02

Up to 8.25x energy reduction compared to GPU

03

Inference speed increased by up to 2.46x over CPU

Abstract

Graphics Processing Units (GPUs) have become the leading hardware accelerator for deep learning applications and are used widely in training and inference of transformers; transformers have achieved state-of-the-art performance in many areas of machine learning and are especially used in most modern Large Language Models (LLMs). However, GPUs require large amounts of energy, which poses environmental concerns, demands high operational costs, and causes GPUs to be unsuitable for edge computing. We develop an accelerator for transformers, namely, Llama 2, an open-source state-of-the-art LLM, using high level synthesis (HLS) on Field Programmable Gate Arrays (FPGAs). HLS allows us to rapidly prototype FPGA designs without writing code at the register-transfer level (RTL). We name our method HLSTransform, and the FPGA designs we synthesize with HLS achieve up to a 12.75x reduction and 8.25x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hlstransform/submission
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCCD and CMOS Imaging Sensors · Embedded Systems Design Techniques · Digital Filter Design and Implementation

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Balanced Selection · LLaMA · VirTex