Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

Luoming Zhang; Wen Fei; Weijia Wu; Yefei He; Zhenyu Lou; Hong Zhou

arXiv:2310.04836·cs.AI·October 10, 2023·1 cites

Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM

Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, Hong Zhou

PDF

Open Access 4 Reviews

TL;DR

This paper introduces Dual Grained Quantization (DGQ), a novel method for efficient low-bit quantization of large language models that balances performance and inference speed, enabling practical deployment.

Contribution

DGQ combines fine-grained and coarse-grained quantization with a two-phase grid search and percentile clipping, improving efficiency and accuracy of LLM quantization.

Findings

01

Outperforms prior quantization methods across various LLMs and tasks.

02

Achieves 1.12x memory reduction and 3.24x speedup with a custom kernel.

03

Enables practical deployment of low-bit LLMs in real-world scenarios.

Abstract

Large Language Models (LLMs) pose significant hardware challenges related to memory requirements and computational ability. There are two mainstream quantization schemes for LLMs: coarse-grained ( $e.g.,$ channel-wise) quantization and fine-grained ( $e.g.,$ group-wise) quantization. Fine-grained quantization has smaller quantization loss, consequently achieving superior performance. However, when applied to weight-activation quantization, it disrupts continuous integer matrix multiplication, leading to inefficient inference. In this paper, we introduce Dual Grained Quantization (DGQ), a novel A8W4 quantization for LLM that maintains superior performance while ensuring fast inference speed. DSQ dequantizes the fine-grained INT4 weight into coarse-grained INT8 representation and preform matrix multiplication using INT8 kernels. Besides, we develop a two-phase grid search…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

+ The paper is well-organized and easy to follow. + The proposed percentile clipping smoothing is very interesting. It combines both clipping and smoothing into one smooth scale. + The results of W4A8 with per-token activation quantization without group quantization are impressive. + Evaluation experiments on real GPU machines with well-implemented kernels look very solid, and the measured runtime and memory usage are very promising.

Weaknesses

+ The paper writing is not clear enough. Many details in the proposed techniques are missing, such as the granularity of activation quantization, and the calibration dataset. Please see the questions below. + The novelty of dual-level quantization is limited. Using two-level scaling factors (UINT4 for group scaling and FP for channel scaling) in quantization was first proposed in VSQuant and has been used in many other works, including QLoRA. + The novelty of the proposed two-phase search for th

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

- This paper combines two types of famous and current quantization approaches. - This paper shows comparable and extensive results on LLaMa and OPT models.

Weaknesses

1) Concerns Regarding Acceleration Results: The foundation of the presented method appears to lie in its kernel design and the W4A8 dequantization results. Typically, quantization techniques are employed to enhance latency and throughput. Yet, there are instances where quantization may not favorably influence acceleration outcomes. For instance, LLM.int8() doesn't seem to offer significantly better kernel results based on its decomposition method. When examining larger batch sizes, both LUT-GEM

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

* W4A8 quantization solution for LLM inference is a practical choice. The overall method is practical in my opinion. * The paper experiments with a kernel implementation. * The paper compares with both weight-only and weight-activation quantization baselines.

Weaknesses

My major concern is the current paper writing is hard to follow, with vague statements and logic. This causes me to have many doubts after reading the paper, see the questions section. Moreover, as the major contribution of this paper is the dual-grained format design more friendly for kernel implementation, providing the kernel implementation as well as showing detailed GPU profiling can be helpful.

Reviewer 04Rating 3· reject, not good enoughConfidence 5

Strengths

+ The proposed DGQ can accelerate the model on general-purpose hardware and avoid designing the specific hardware. + How to effectively quantize the LLMs is important. The main idea in this paper is easy to follow and looks reasonable. + The two-stage grid search seems work well in the experiments.

Weaknesses

- The novelty of this paper is poor. The DGQ is incremental from existing quantization methods. It is extremely unclear why the proposed approach is suitable for LLMs as it can be also leveraged to quantize other models. The authors should analyze the LLMs and provide the motivation that the proposed DGQ is unique for LLMs, like how AWG does. - Fig. 1 looks unfair to other methods. I think if you set other methods as the same A8W4, the memory usage should be similar. Hence, this comparison canno

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings