Post Training Quantization of Large Language Models with Microscaling   Formats

Sayeh Sharify; Utkarsh Saxena; Zifei Xu; Wanzin Yazar; Ilya; Soloveychik; Xin Wang

arXiv:2405.07135·cs.LG·October 17, 2024

Post Training Quantization of Large Language Models with Microscaling Formats

Sayeh Sharify, Utkarsh Saxena, Zifei Xu, Wanzin Yazar, Ilya, Soloveychik, Xin Wang

PDF

Open Access

TL;DR

This paper investigates post-training quantization methods for large language models, extending their applicability to microscaling formats, and demonstrates effective 4-bit weight and 8-bit activation quantization with minimal accuracy loss.

Contribution

It introduces the extension of PTQ methods to microscaling formats and systematically analyzes their interactions for improved LLM quantization.

Findings

01

Quantization to 4-bit weights and 8-bit activations is achievable with negligible accuracy loss.

02

Combining PTQ methods enhances the quantization process for large language models.

03

Extending PTQ to microscaling formats broadens the applicability of quantization techniques.

Abstract

Large Language Models (LLMs) have distinguished themselves with outstanding performance in complex language modeling tasks, yet they come with significant computational and storage challenges. This paper explores the potential of quantization to mitigate these challenges. We systematically study the combined application of three well-known post-training techniques, SmoothQuant, AWQ, and GPTQ, and provide a comprehensive analysis of their interactions and implications for advancing LLM quantization. We enhance the versatility of these methods by enabling quantization to microscaling (MX) formats, extending the applicability of these PTQ algorithms beyond their original fixed-point format targets. We show that combining different PTQ methods enables us to quantize models to 4-bit weights and 8-bit activations using the MXINT format with negligible accuracy loss compared to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsOPT · LLaMA