Analyzing Quantization in TVM

Mingfei Guo

arXiv:2308.10905·cs.LG·August 23, 2023

Analyzing Quantization in TVM

Mingfei Guo

PDF

Open Access

TL;DR

This paper investigates why 8-bit quantization in TVM underperforms and proposes optimization strategies, including bug fixes, leading to significant inference time improvements for both compute-bound and memory-bound tasks.

Contribution

The paper identifies performance issues in TVM's 8-bit quantization, implements bug fixes, and evaluates optimization techniques to significantly enhance inference speed.

Findings

01

8-bit quantization can be optimized to outperform baseline by over 160%

02

Performance issues were traced to a bug in graph building

03

Optimization strategies yield nearly 195% speedup in memory-bound tasks

Abstract

There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Tensor decomposition and applications

Methodsfail