EXAQ: Exponent Aware Quantization For LLMs Acceleration

Moran Shkolnik; Maxim Fishman; Brian Chmiel; Hilla Ben-Yaacov; Ron; Banner; Kfir Yehuda Levy

arXiv:2410.03185·cs.LG·October 7, 2024

EXAQ: Exponent Aware Quantization For LLMs Acceleration

Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron, Banner, Kfir Yehuda Levy

PDF

Open Access 1 Repo

TL;DR

This paper introduces EXAQ, a method for optimizing the softmax layer in quantized LLMs, achieving sub-4-bit quantization and approximately 4x acceleration with minimal accuracy loss.

Contribution

It proposes an analytical approach to determine optimal clipping for softmax inputs, enabling ultra-low bit quantization and significant acceleration in LLM inference.

Findings

01

Achieves 2-bit quantization with baseline performance on LLaMA1-30B.

02

Realizes approximately 4x acceleration in the softmax accumulation phase.

03

Attains a 36.9% overall acceleration in the softmax operation.

Abstract

Quantization has established itself as the primary approach for decreasing the computational and storage expenses associated with Large Language Models (LLMs) inference. The majority of current research emphasizes quantizing weights and activations to enable low-bit general-matrix-multiply (GEMM) operations, with the remaining non-linear operations executed at higher precision. In our study, we discovered that following the application of these techniques, the primary bottleneck in LLMs inference lies in the softmax layer. The softmax operation comprises three phases: exponent calculation, accumulation, and normalization, Our work focuses on optimizing the first two phases. We propose an analytical approach to determine the optimal clipping value for the input to the softmax function, enabling sub-4-bit quantization for LLMs inference. This method accelerates the calculations of both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anonymous1252022/exaq
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Imaging Techniques and Applications · Distributed and Parallel Computing Systems

MethodsSoftmax