UNIT: Unifying Tensorized Instruction Compilation
Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, and Tony, Nowatzki

TL;DR
This paper introduces UNIT, a unified compiler framework that simplifies and automates the utilization of tensorized instructions across different hardware platforms, significantly improving DNN inference performance.
Contribution
We develop a unified compiler framework that abstracts and automates the compilation of tensorized instructions from multiple hardware vendors, enabling easier integration and optimization.
Findings
Achieves 1.3x speedup over Intel oneDNN on x86 CPU
Achieves 1.75x speedup over Nvidia cuDNN on Nvidia GPU
Achieves 1.13x speedup over tuned TVM on ARM CPU
Abstract
Because of the increasing demand for computation in DNN, researchers develope both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to leverage mixed precision without hardware support because of the overhead of data casting. Hardware vendors offer tensorized instructions for mixed-precision tensor operations, like Intel VNNI, Tensor Core, and ARM-DOT. These instructions involve a computing idiom that reduces multiple low precision elements into one high precision element. The lack of compilation techniques for this makes it hard to utilize these instructions: Using vendor-provided libraries for computationally-intensive kernels is inflexible and prevents further optimizations, and manually writing hardware intrinsics is error-prone and difficult for programmers. Some prior works…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
