Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

Chao Fang; Man Shi; Robin Geens; Arne Symons; Zhongfeng Wang; Marian Verhelst

arXiv:2411.15982·cs.AR·May 13, 2025

Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format

Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst

PDF

Open Access

TL;DR

Anda introduces a novel adaptive data format and hardware optimizations that significantly improve the speed, area, and energy efficiency of large language model inference by better managing floating-point activations.

Contribution

The paper proposes the Anda data type with adaptive precision, an iterative search algorithm for bit-width optimization, and hardware techniques to enhance LLM inference efficiency.

Findings

01

2.4x speedup in FPINT GeMM operations

02

4.0x area efficiency improvement

03

3.1x energy efficiency gain

Abstract

The widely-used, weight-only quantized large language models (LLMs), which leverage low-bit integer (INT) weights and retain floating-point (FP) activations, reduce storage requirements while maintaining accuracy. However, this shifts the energy and latency bottlenecks towards the FP activations that are associated with costly memory accesses and computations. Existing LLM accelerators focus primarily on computation optimizations, overlooking the potential of jointly optimizing FP computations and data movement, particularly for the dominant FP-INT GeMM operations in LLM inference. To address these challenges, we investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy. Based on our findings, we first propose the Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis

MethodsLLaMA · OPT · Focus