Pushing the Limits of BFP on Narrow Precision LLM Inference

Hui Wang; Yuan Cheng; Xiaomeng Han; Zhengpeng Zhao; Dawei Yang; Zhe; Jiang

arXiv:2502.00026·cs.AR·February 10, 2025

Pushing the Limits of BFP on Narrow Precision LLM Inference

Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe, Jiang

PDF

Open Access 1 Video

TL;DR

This paper introduces a hardware-software co-design framework, DB-Attn, that enhances nonlinear operation efficiency in large language models using an advanced BFP format, achieving significant speedups with minimal accuracy loss.

Contribution

The paper presents DBFP, an improved BFP format, and DH-LUT, a lookup table algorithm, along with an RTL engine, to optimize nonlinear operations in LLM inference.

Findings

01

74% GPU speedup on Softmax of LLaMA

02

10x performance improvement over state-of-the-art designs

03

Negligible accuracy loss with the proposed methods

Abstract

The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Pushing the Limits of BFP on Narrow Precision LLM Inference· underline

Taxonomy

TopicsImbalanced Data Classification Techniques · Advanced Algorithms and Applications · Vehicle License Plate Recognition

MethodsAttention Is All You Need · Softmax · LLaMA