QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision   Quantization for Practical Embedded AI Applications

Jeongseok Kim; Jemin Lee; Yongin Kwon; Daeyoung Kim

arXiv:2501.07161·cs.AI·January 14, 2025

QuantuneV2: Compiler-Based Local Metric-Driven Mixed Precision Quantization for Practical Embedded AI Applications

Jeongseok Kim, Jemin Lee, Yongin Kwon, Daeyoung Kim

PDF

TL;DR

QuantuneV2 is a compiler-based mixed-precision quantization method that reduces runtime overhead and improves accuracy for embedded AI applications by performing inference only twice and optimizing the compilation process.

Contribution

It introduces a novel compiler-level approach for mixed-precision quantization that minimizes computational overhead and enhances model accuracy without retraining.

Findings

01

Achieved up to 10.28% accuracy improvement

02

Realized 12.52% speed increase over existing methods

03

Validated on five different neural network models

Abstract

Mixed-precision quantization methods have been proposed to reduce model size while minimizing accuracy degradation. However, existing studies require retraining and do not consider the computational overhead and intermediate representations (IR) generated during the compilation process, limiting their application at the compiler level. This computational overhead refers to the runtime latency caused by frequent quantization and dequantization operations during inference. Performing these operations at the individual operator level causes significant runtime delays. To address these issues, we propose QuantuneV2, a compiler-based mixed-precision quantization method designed for practical embedded AI applications. QuantuneV2 performs inference only twice, once before quantization and once after quantization, and operates with a computational complexity of O(n) that increases linearly with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings