OPAL: Outlier-Preserved Microscaling Quantization Accelerator for   Generative Large Language Models

Jahyun Koo; Dahoon Park; Sangwoo Jung; Jaeha Kung

arXiv:2409.05902·cs.LG·September 25, 2024

OPAL: Outlier-Preserved Microscaling Quantization Accelerator for Generative Large Language Models

Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung

PDF

Open Access

TL;DR

OPAL is a hardware-software co-designed accelerator for large language models that uses novel activation quantization with outlier preservation and mixed precision to significantly improve energy efficiency and reduce area with minimal accuracy loss.

Contribution

The paper introduces a new activation quantization method with outlier preservation and mixed precision, along with a specialized hardware architecture for efficient LLM acceleration.

Findings

01

Energy efficiency improved by 1.6 to 2.2 times

02

Area reduced by 2.4 to 3.1 times

03

Negligible accuracy loss (<1 perplexity increase)

Abstract

To overcome the burden on the memory size and bandwidth due to ever-increasing size of large language models (LLMs), aggressive weight quantization has been recently studied, while lacking research on quantizing activations. In this paper, we present a hardware-software co-design method that results in an energy-efficient LLM accelerator, named OPAL, for generation tasks. First of all, a novel activation quantization method that leverages the microscaling data format while preserving several outliers per sub-tensor block (e.g., four out of 128 elements) is proposed. Second, on top of preserving outliers, mixed precision is utilized that sets 5-bit for inputs to sensitive layers in the decoder block of an LLM, while keeping inputs to less sensitive layers to 3-bit. Finally, we present the OPAL hardware architecture that consists of FP units for handling outliers and vectorized INT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsSoftmax