Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores
Ganesh Bikshandi

TL;DR
This paper introduces a hardware-aware reformulation method for CNNs that restructures computations to meet hardware constraints like NVIDIA Tensor Cores' input channel requirements, enabling efficient post-training deployment without weight modification.
Contribution
It presents the first hardware-aware reformulation approach for CNNs that satisfies hardware constraints through post-training math restructuring, without altering network weights.
Findings
Reformulation achieves hardware alignment without zero-padding.
Method improves efficiency of CNN deployment on NVIDIA Tensor Cores.
Framework is generalizable to other hardware accelerators.
Abstract
Convolutional Neural Networks (CNNs) are central to modern AI, but their performance is often limited by hardware constraints. NVIDIA Tensor Cores, for instance, require input channels to be multiples of 8 and sometimes 512 for efficient execution. {\em oneDNN} framework for CPU imposes such a requirement for the blocked format. Traditional approaches address such alignment issue using zero-padding, which can be inefficient. In this work, we present a first-step, hardware-aware reformulation of CNN computations using rewrite rules, restructuring the underlying math to satisfy hardware alignment entirely {\bf post-training} without modifying network weights. While our current implementation focuses on a single transformation for Tensor Cores, this approach is generalizable, laying the foundation to explore additional transformations for CPU and accelerators. This study represents an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Adversarial Robustness in Machine Learning
