TL;DR
Quantune is an auto-tuning framework that uses gradient boosting to efficiently optimize post-training quantization configurations for CNNs, significantly reducing search time while maintaining high accuracy.
Contribution
It introduces Quantune, a gradient boosting-based auto-tuner that accelerates the search for optimal quantization settings, outperforming traditional search methods in speed and accuracy.
Findings
Reduces quantization search time by approximately 36.5x.
Maintains accuracy loss within 0.07% to 0.65%.
Effective across diverse CNN models, including fragile ones.
Abstract
To adopt convolutional neural networks (CNN) for a range of resource-constrained targets, it is necessary to compress the CNN models by performing quantization, whereby precision representation is converted to a lower bit representation. To overcome problems such as sensitivity of the training dataset, high computational requirements, and large time consumption, post-training quantization methods that do not require retraining have been proposed. In addition, to compensate for the accuracy drop without retraining, previous studies on post-training quantization have proposed several complementary methods: calibration, schemes, clipping, granularity, and mixed-precision. To generate a quantized model with minimal error, it is necessary to study all possible combinations of the methods because each of them is complementary and the CNN models have different characteristics. However, an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Max Pooling · Softmax · Global Average Pooling · Residual Connection · Fire Module · Dropout · Xavier Initialization · Convolution
