Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators
Md Rownak Hossain Chowdhury, Mostafizur Rahman

TL;DR
This paper introduces a hardware-centric framework to efficiently implement 7-D convolution loops in reconfigurable AI accelerators, reducing control overhead and improving data reuse for high-performance neural network inference.
Contribution
It reinterprets the 7-D convolution loop nest as a data and instruction streaming problem, enabling flexible, lightweight deployment on reconfigurable hardware without heavy transformations.
Findings
Over 90% PE utilization in MAVeC accelerator
Achieved 1.56 TFLOPs/sec throughput for VGG-16
Supported full VGG-16 inference with scalable performance
Abstract
Convolution remains the most compute-intensive operation in AI acceleration, often constituting over 80-90% of the workload. Existing approaches in spatial architectures such as coarse-grained reconfigurable arrays (CGRAs) and field-programmable gate arrays (FPGAs) frequently rely on loop unrolling or GEMM-based matrix transformations, introducing significant overhead in both data movement and instruction control. This paper presents a new framework designed to systematically demystify the 7-dimensional convolution loop nest by reinterpreting it as a hardware-centric data and instruction streaming problem. Instead of treating the loop nest as a fixed computational construct, our approach exposes its structure as a set of spatial and temporal mappings governed by hardware parameters such as compute element distribution, interconnect topology, and reconfigurability. This abstraction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
