
TL;DR
This paper investigates why 8-bit quantization in TVM underperforms and proposes optimization strategies, including bug fixes, leading to significant inference time improvements for both compute-bound and memory-bound tasks.
Contribution
The paper identifies performance issues in TVM's 8-bit quantization, implements bug fixes, and evaluates optimization techniques to significantly enhance inference speed.
Findings
8-bit quantization can be optimized to outperform baseline by over 160%
Performance issues were traced to a bug in graph building
Optimization strategies yield nearly 195% speedup in memory-bound tasks
Abstract
There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50% of the full-precision inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Tensor decomposition and applications
Methodsfail
