NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Wenlei Bao; Li-Wen Chang; Yang Chen; Ke Deng; Amit Agarwal; Emad; Barsoum; Abe Taha

arXiv:1910.00178·cs.LG·November 15, 2019·1 cites

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques

Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad, Barsoum, Abe Taha

PDF

Open Access

TL;DR

NGEMM is a compiler-based optimized GEMM implementation that significantly accelerates low-precision matrix multiplication for deep learning, outperforming existing libraries and enhancing production service efficiency.

Contribution

This paper introduces NGEMM, a novel compiler-based approach that improves integer GEMM performance for quantized DNNs by optimizing vector unit utilization and reducing unnecessary computations.

Findings

01

NGEMM outperforms MKL non-pack by 1.86x on average.

02

NGEMM outperforms MKL pack by 1.16x on average.

03

Successfully deployed in Microsoft production services.

Abstract

Quantization has emerged to be an effective way to significantly boost the performance of deep neural networks (DNNs) by utilizing low-bit computations. Despite having lower numerical precision, quantized DNNs are able to reduce both memory bandwidth and computation cycles with little losses of accuracy. Integer GEMM (General Matrix Multiplication) is critical to running quantized DNN models efficiently, as GEMM operations often dominate the computations in these models. Various approaches have been developed by leveraging techniques such as vectorization and memory layout to improve the performance of integer GEMM. However, these existing approaches are not fast enough in certain scenarios. We developed NGEMM, a compiler-based GEMM implementation for accelerating lower-precision training and inference. NGEMM has better use of the vector units by avoiding unnecessary vector computation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Machine Learning and Algorithms