QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications
Jeongseok Kim, Jemin Lee, Yongin Kwon, Daeyoung Kim

TL;DR
QuantuneV2 is a compiler-based mixed-precision quantization method that reduces runtime overhead and improves accuracy for embedded AI applications by performing inference only twice and optimizing the compilation process.
Contribution
It introduces a novel compiler-level approach for mixed-precision quantization that minimizes computational overhead and enhances model accuracy without retraining.
Findings
Achieved up to 10.28% accuracy improvement
Realized 12.52% speed increase over existing methods
Validated on five different neural network models
Abstract
Mixed-precision quantization methods have been proposed to reduce model size while minimizing accuracy degradation. However, existing studies require retraining and do not consider the computational overhead and intermediate representations (IR) generated during the compilation process, limiting their application at the compiler level. This computational overhead refers to the runtime latency caused by frequent quantization and dequantization operations during inference. Performing these operations at the individual operator level causes significant runtime delays. To address these issues, we propose QuantuneV2, a compiler-based mixed-precision quantization method designed for practical embedded AI applications. QuantuneV2 performs inference only twice, once before quantization and once after quantization, and operates with a computational complexity of O(n) that increases linearly with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
