EXAQ: Exponent Aware Quantization For LLMs Acceleration
Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron, Banner, Kfir Yehuda Levy

TL;DR
This paper introduces EXAQ, a method for optimizing the softmax layer in quantized LLMs, achieving sub-4-bit quantization and approximately 4x acceleration with minimal accuracy loss.
Contribution
It proposes an analytical approach to determine optimal clipping for softmax inputs, enabling ultra-low bit quantization and significant acceleration in LLM inference.
Findings
Achieves 2-bit quantization with baseline performance on LLaMA1-30B.
Realizes approximately 4x acceleration in the softmax accumulation phase.
Attains a 36.9% overall acceleration in the softmax operation.
Abstract
Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging Techniques and Applications · Distributed and Parallel Computing Systems
MethodsSoftmax
