Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores

Ganesh Bikshandi

arXiv:2601.11608·cs.DC·January 21, 2026

Hardware-Aware Reformulation of Convolutions for Efficient Execution on Specialized AI Hardware: A Case Study on NVIDIA Tensor Cores

Ganesh Bikshandi

PDF

Open Access

TL;DR

This paper introduces a hardware-aware reformulation method for CNNs that restructures computations to meet hardware constraints like NVIDIA Tensor Cores' input channel requirements, enabling efficient post-training deployment without weight modification.

Contribution

It presents the first hardware-aware reformulation approach for CNNs that satisfies hardware constraints through post-training math restructuring, without altering network weights.

Findings

01

Reformulation achieves hardware alignment without zero-padding.

02

Method improves efficiency of CNN deployment on NVIDIA Tensor Cores.

03

Framework is generalizable to other hardware accelerators.

Abstract

Convolutional Neural Networks (CNNs) are central to modern AI, but their performance is often limited by hardware constraints. NVIDIA Tensor Cores, for instance, require input channels to be multiples of 8 and sometimes 512 for efficient execution. {\em oneDNN} framework for CPU imposes such a requirement for the blocked format. Traditional approaches address such alignment issue using zero-padding, which can be inefficient. In this work, we present a first-step, hardware-aware reformulation of CNN computations using rewrite rules, restructuring the underlying math to satisfy hardware alignment entirely {\bf post-training} without modifying network weights. While our current implementation focuses on a single transformation for Tensor Cores, this approach is generalizable, laying the foundation to explore additional transformations for CPU and accelerators. This study represents an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Tensor decomposition and applications · Adversarial Robustness in Machine Learning