tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary   Based Matrix Multiply Unit

Prabhu Vellaisamy; Harideep Nair; Joseph Finn; Manav Trivedi; Albert; Chen; Anna Li; Tsung-Han Lin; Perry Wang; Shawn Blanton; and John Paul Shen

arXiv:2412.17955·cs.AR·December 25, 2024

tubGEMM: Energy-Efficient and Sparsity-Effective Temporal-Unary-Binary Based Matrix Multiply Unit

Prabhu Vellaisamy, Harideep Nair, Joseph Finn, Manav Trivedi, Albert, Chen, Anna Li, Tsung-Han Lin, Perry Wang, Shawn Blanton, and John Paul Shen

PDF

TL;DR

The paper introduces tubGEMM, a novel energy-efficient matrix multiply unit using hybrid encoding that performs exact computations and exploits sparsity, significantly reducing hardware costs and energy consumption for deep learning applications.

Contribution

It presents tubGEMM, a new matrix multiply unit design employing hybrid temporal-unary and binary encoding for exact GEMM, improving energy efficiency and exploiting sparsity.

Findings

01

Reduces area, power, and energy by 89%, 87%, and 50% compared to uGEMM.

02

Consumes 0.22 mm^2, 417.72 mW, and 8.86 uJ for 128x128 matrix multiply at 8-bit.

03

Sparsity in DL workloads reduces energy by over 3x; lower precision further reduces energy consumption.

Abstract

General Matrix Multiplication (GEMM) is a ubiquitous compute kernel in deep learning (DL). To support energy-efficient edge-native processing, new GEMM hardware units have been proposed that operate on unary encoded bitstreams using much simpler hardware. Most unary approaches thus far focus on rate-based unary encoding of values and perform stochastic approximate computation. This work presents tubGEMM, a novel matrix-multiply unit design that employs hybrid temporal-unary and binary (tub) encoding and performs exact (not approximate) GEMM. It intrinsically exploits dynamic value sparsity to improve energy efficiency. Compared to the current best unary design uGEMM, tubGEMM significantly reduces area, power, and energy by 89\%, 87\%, and 50\%, respectively. A tubGEMM design performing 128x128 matrix multiply on 8-bit integers, in commercial TSMC N5 (5nm) process node, consumes just…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.