DRACO: Co-Optimizing Hardware Utilization, and Performance of DNNs on Systolic Accelerator
Nandan Kumar Jha, Shreyas Ravishankar, Sparsh Mittal, Arvind Kaushik,, Dipan Mandal, Mahesh Chandra

TL;DR
DRACO is an algorithm-level co-optimization method that enhances PE utilization and energy efficiency for memory-bound DNNs on systolic accelerators without hardware modifications, also improving predictive performance.
Contribution
It introduces DRACO, a novel algorithm-level co-optimization approach that addresses PE underutilization and enhances DNN predictive accuracy on systolic arrays.
Findings
41.8% improvement in PE utilization
42.6% reduction in inference latency
Negligible loss in predictive performance
Abstract
The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; whereas, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy efficiency. To mitigate this, specialized dataflow and/or micro-architectural techniques have been proposed. However, due to the longer development cycle and the rapid pace of evolution in the deep learning fields, these hardware-based solutions can be obsolete and ineffective in dealing with PE underutilization for state-of-the-art DNNs. In this work, we address the challenge of PE underutilization at the algorithm front and propose data reuse aware co-optimization (DRACO). This improves the PE utilization of memory-bound DNNs without any additional need for dataflow/micro-architecture modifications. Furthermore, unlike the previous co-optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPointwise Convolution · Depthwise Convolution · Softmax · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Average Pooling · Depthwise Separable Convolution · 1x1 Convolution · Dense Connections · Global Average Pooling
