SplitQuant: Layer Splitting for Low-Bit Neural Network Quantization
Jaewoo Song, Fangzhen Lin

TL;DR
SplitQuant is a novel layer splitting method that improves low-bit neural network quantization by better handling outliers, leading to higher accuracy in quantized models like BERT-Tiny.
Contribution
The paper introduces SplitQuant, a new layer splitting technique that preserves outliers and enhances quantization resolution for low-bit neural networks.
Findings
Improved INT2 quantization accuracy by up to 3.3 percentage points.
Achieved quantized model accuracy comparable to FP32 models.
Effective on BERT-Tiny models with minimal accuracy loss.
Abstract
Quantization for deep neural networks (DNNs) is the process of mapping the parameter values of DNNs from original data types to other data types of lower precision to reduce model sizes and make inference faster. Quantization often maps different original values to a single quantized value because the range of the original values is larger than the range of the quantized values. This leads to the degradation of the accuracy of the quantized DNNs. Outliers are a main cause of the degradation of quantization resolution because they enlarge the range of original values. To solve the problem, the percentile method is often used to clip outliers. However, clipping the outliers has another problem of removing the important and strong signals in the DNNs. This paper proposes SplitQuant to keep the outliers and improve the quantization resolution at the same time. SplitQuant narrows down the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBrain Tumor Detection and Classification · Neural Networks and Applications
MethodsContrastive Language-Image Pre-training
