Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Yilong Zhao; Chien-Yu Lin; Kan Zhu; Zihao Ye; Lequn Chen; Size Zheng,; Luis Ceze; Arvind Krishnamurthy; Tianqi Chen; Baris Kasikci

arXiv:2310.19102·cs.LG·April 17, 2024·23 cites

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng,, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

PDF

Open Access 1 Repo 1 Models

TL;DR

Atom introduces a low-bit quantization technique that leverages modern GPU capabilities to significantly improve LLM serving throughput with minimal accuracy loss, using mixed-precision and fine-grained strategies.

Contribution

The paper presents Atom, a novel low-bit quantization method that enhances LLM serving efficiency by utilizing 4-bit operators and mixed-precision quantization, outperforming existing schemes.

Findings

01

Up to 7.7x throughput increase over FP16

02

2.5x throughput increase over INT8

03

Maintains accuracy with negligible loss

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

efeslab/atom
pytorchOfficial

Models

🤗
Ranjanunicode/unicode-llama-2-chat-Hf-q4-gguf
model· 14 dl
14 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings