Pushing the Limits of BFP on Narrow Precision LLM Inference
Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe, Jiang

TL;DR
This paper introduces a hardware-software co-design framework, DB-Attn, that enhances nonlinear operation efficiency in large language models using an advanced BFP format, achieving significant speedups with minimal accuracy loss.
Contribution
The paper presents DBFP, an improved BFP format, and DH-LUT, a lookup table algorithm, along with an RTL engine, to optimize nonlinear operations in LLM inference.
Findings
74% GPU speedup on Softmax of LLaMA
10x performance improvement over state-of-the-art designs
Negligible accuracy loss with the proposed methods
Abstract
The substantial computational and memory demands of Large Language Models (LLMs) hinder their deployment. Block Floating Point (BFP) has proven effective in accelerating linear operations, a cornerstone of LLM workloads. However, as sequence lengths grow, nonlinear operations, such as Attention, increasingly become performance bottlenecks due to their quadratic computational complexity. These nonlinear operations are predominantly executed using inefficient floating-point formats, which renders the system challenging to optimize software efficiency and hardware overhead. In this paper, we delve into the limitations and potential of applying BFP to nonlinear operations. Given our findings, we introduce a hardware-software co-design framework (DB-Attn), including: (i) DBFP, an advanced BFP version, overcomes nonlinear operation challenges with a pivot-focus strategy for diverse data and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImbalanced Data Classification Techniques · Advanced Algorithms and Applications · Vehicle License Plate Recognition
MethodsAttention Is All You Need · Softmax · LLaMA
