A Comprehensive Study on Quantization Techniques for Large Language   Models

Jiedong Lang; Zhehao Guo; Shuyu Huang

arXiv:2411.02530·cs.LG·November 6, 2024·2 cites

A Comprehensive Study on Quantization Techniques for Large Language Models

Jiedong Lang, Zhehao Guo, Shuyu Huang

PDF

Open Access

TL;DR

This paper provides a comprehensive analysis of quantization techniques to reduce the size and computational requirements of large language models, facilitating their deployment on resource-limited devices.

Contribution

It offers an in-depth review of quantization methods, their mathematical foundations, implementation details, and performance outcomes specifically for large language models.

Findings

01

Quantization significantly reduces model size and inference time.

02

Different quantization techniques vary in accuracy and efficiency trade-offs.

03

Quantization enables deployment of LLMs on resource-constrained devices.

Abstract

Large Language Models (LLMs) have been extensively researched and used in both academia and industry since the rise in popularity of the Transformer model, which demonstrates excellent performance in AI. However, the computational demands of LLMs are immense, and the energy resources required to run them are often limited. For instance, popular models like GPT-3, with 175 billion parameters and a storage requirement of 350 GB, present significant challenges for deployment on resource-constrained IoT devices and embedded systems. These systems often lack the computational capacity to handle such large models. Quantization, a technique that reduces the precision of model values to a smaller set of discrete values, offers a promising solution by reducing the size of LLMs and accelerating inference. In this research, we provide a comprehensive analysis of quantization techniques within the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Clustering Algorithms Research · Text and Document Classification Technologies

MethodsAttention Is All You Need · Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Warmup With Cosine Annealing · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Dropout · Adam · Residual Connection · Weight Decay · {Dispute@FaQ-s}How to file a dispute with Expedia?