Channel-Wise Mixed-Precision Quantization for Large Language Models
Zihan Chen, Bike Xie, Jundong Li, Cong Shen

TL;DR
This paper introduces CMPQ, a channel-wise mixed-precision quantization method for large language models that adapts precision levels per channel to reduce memory usage while maintaining performance.
Contribution
The paper proposes a novel mixed-precision quantization approach that allocates precision per channel based on activation distributions, improving adaptability and performance of LLMs on edge devices.
Findings
CMPQ improves quantization performance across different LLM sizes.
CMPQ achieves significant performance gains with modest memory increase.
The method effectively preserves critical information during quantization.
Abstract
Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The proposed method shows better performance over GPTQ/AWQ, etc., on various large language models.
1. The contribution of this paper is relatively minor and could be considered essentially an extended version of SmoothQuant to a limited extent. 2. By comparing Tables 1 and 2, the performance improvements over the baseline are not stable and, in some cases, are worse. For example, under 4-bit precision on the C4 dataset for LLaMA2-7B, compared with QuIP (7.10 vs. 6.69), more discussion is required. 3. The ablation results show that w/o Oact has little effect, indicating that output activa
The motivation is compelling, as Figure 2 clearly illustrates that different channels, particularly those associated with tokens exhibiting massive activation, have varying impacts on overall accuracy. This suggests that using uniform bit allocations across all channels could be highly inefficient. To address this, the authors incorporate established outlier detection techniques to manage these outlier variations effectively. Their results show that only a few channels require different bit allo
Unfortunately, this manuscript faces several critical issues that limit its practicality. Here are some examples to illustrate these concerns: - The manuscript does not address how to accelerate the proposed quantization method effectively. Since the bit allocations vary across different channels, parallel operations that process multiple channels simultaneously could introduce significant performance bottlenecks, requiring special handling to mitigate these issues. - The selection of outlier
One of the strengths of this paper is its focused and in-depth exploration of outliers in quantization. Unlike many previous methods, which either overlook or handle outliers with basic techniques, CMPQ presents a comprehensive approach by categorizing outliers into activation-based and quantization-sensitive types. This dual approach is a significant step forward, as it ensures that high-impact weights are preserved where they matter most, leading to enhanced performance stability across differ
1. While CMPQ leverages mixed-precision quantization to improve model performance, the paper lacks a rigorous theoretical framework for determining the allocation of different bit widths. Mixed-precision inherently introduces challenges in establishing precise criteria and policies for bit-width allocation across channels, which often leads to overly practical and heuristic-based solutions. Without a solid theoretical foundation, the method risks being overly tailored to specific scenarios, limi
1. The channel-wise mixed-precision quantization can help deal with channel outliers. 2. The experimental results indicate substantial improvements in accuracy after quantization.
1. There are typography disorders on page 7. 2. The overall process of CMPQ during inference is not well elaborated. For example, how do you perform de-quantization efficiently? 3. The real deployment efficiency is not presented. Since CMPQ sets different precision for each channel, I am worried about its de-quantization overhead during inference.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsFocus
