Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst

TL;DR
Anda introduces a novel adaptive data format and hardware optimizations that significantly improve the speed, area, and energy efficiency of large language model inference by better managing floating-point activations.
Contribution
The paper proposes the Anda data type with adaptive precision, an iterative search algorithm for bit-width optimization, and hardware techniques to enhance LLM inference efficiency.
Findings
2.4x speedup in FPINT GeMM operations
4.0x area efficiency improvement
3.1x energy efficiency gain
Abstract
The widely-used, weight-only quantized large language models (LLMs), which leverage low-bit integer (INT) weights and retain floating-point (FP) activations, reduce storage requirements while maintaining accuracy. However, this shifts the energy and latency bottlenecks towards the FP activations that are associated with costly memory accesses and computations. Existing LLM accelerators focus primarily on computation optimizations, overlooking the potential of jointly optimizing FP computations and data movement, particularly for the dominant FP-INT GeMM operations in LLM inference. To address these challenges, we investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy. Based on our findings, we first propose the Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
MethodsLLaMA · OPT · Focus
