Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for   Low-Precision Edge DLAs

Prabhu Vellaisamy; Harideep Nair; Thomas Kang; Yichen Ni; Haoyang Fan,; Bin Qi; Jeff Chen; Shawn Blanton; and John Paul Shen

arXiv:2412.19002·cs.AR·December 30, 2024

Tempus Core: Area-Power Efficient Temporal-Unary Convolution Core for Low-Precision Edge DLAs

Prabhu Vellaisamy, Harideep Nair, Thomas Kang, Yichen Ni, Haoyang Fan,, Bin Qi, Jeff Chen, Shawn Blanton, and John Paul Shen

PDF

Open Access

TL;DR

Tempus Core introduces a scalable unary-based convolution core that significantly reduces area and power consumption in deep learning accelerators, enabling efficient edge AI inference with minimal hardware overhead.

Contribution

This work presents Tempus Core, a novel unary-based PE array design that integrates seamlessly with existing DLAs, offering substantial improvements in area, power, and throughput for low-precision neural network inference.

Findings

01

Achieves 59.3% area and 15.3% power reduction over NVDLA's CMAC at 45nm CMOS.

02

Delivers 5x and 4x throughput improvements for INT8 and INT4 precisions respectively.

03

Requires only 0.017 mm^2 die area and 6.2mW power for a 16x4 PE array at INT4 precision.

Abstract

The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Network Technologies · Photonic and Optical Devices · Neural Networks and Reservoir Computing

MethodsConvolution · Deep Layer Aggregation