Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference
Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina

TL;DR
This paper introduces the Systolic Tensor Array (STA), an optimized hardware architecture for CNN inference on mobile devices, featuring tensor processing elements and support for block-sparse data formats to improve efficiency and reduce power consumption.
Contribution
It generalizes the traditional systolic array into a tensor-based architecture and supports a novel block-sparse format, achieving significant improvements in area and power efficiency.
Findings
STA reduces circuit area by up to 2.08x compared to traditional SA.
STA-DBB achieves up to 3.14x area and 1.97x power improvements over baseline.
Supports dense and sparse models with high efficiency.
Abstract
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference. Firstly, we generalize the traditional scalar PE, into a Tensor-PE, which gives rise to a family of new Systolic Tensor Array (STA) microarchitectures. The STA family increases intra-PE operand reuse and datapath efficiency, resulting in circuit area and power dissipation reduction of as much as 2.08x and 1.36x respectively, compared to the conventional SA at iso-throughput with INT8 operands. Secondly, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Advanced Memory and Neural Computing
