Characterizing and Demystifying the Implicit Convolution Algorithm on   Commercial Matrix-Multiplication Accelerators

Yangjie Zhou; Mengtian Yang; Cong Guo; Jingwen Leng; Yun Liang; Quan; Chen; Minyi Guo; Yuhao Zhu

arXiv:2110.03901·cs.DC·October 11, 2021·5 cites

Characterizing and Demystifying the Implicit Convolution Algorithm on Commercial Matrix-Multiplication Accelerators

Yangjie Zhou, Mengtian Yang, Cong Guo, Jingwen Leng, Yun Liang, Quan, Chen, Minyi Guo, Yuhao Zhu

PDF

Open Access

TL;DR

This paper analyzes a memory-efficient implicit im2col algorithm used in Google's TPU and demonstrates its effectiveness and potential applicability to Nvidia's Tensor Cores, enhancing convolution support on GEMM-based accelerators.

Contribution

It introduces a novel implicit im2col algorithm that is both hardware-friendly and scalable, enabling efficient convolution on GEMM-based accelerators like TPU and TCs.

Findings

01

The algorithm is adopted in commercial platforms.

02

It achieves near-zero overhead in converting convolution to GEMM.

03

It outperforms existing methods on Nvidia's Tensor Cores.

Abstract

Many of today's deep neural network accelerators, e.g., Google's TPU and NVIDIA's tensor core, are built around accelerating the general matrix multiplication (i.e., GEMM). However, supporting convolution on GEMM-based accelerators is not trivial. The naive method explicitly lowers the convolution to GEMM, commonly known as im2col, which introduces significant performance and memory overhead. Existing implicit im2col algorithms require unscalable hardware and are inefficient in supporting important convolution variants such as strided convolution. In this paper, we propose a memory-efficient and hardware-friendly implicit im2col algorithm used by Google's TPU, which dynamically converts a convolution into a GEMM with practically zero performance and memory overhead, fully unleashing the power of GEMM engines. Through comprehensive experimental results, we quantitatively argue that this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Stochastic Gradient Optimization Techniques