A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators
Luca Colagrande, Lorenzo Leone, Chen Wu, Tim Fischer, Raphael Roth, Luca Benini

TL;DR
This paper introduces a lightweight, high-throughput NoC with collective capabilities, enabling efficient communication and computation for large-scale ML accelerators, significantly improving speed and energy efficiency.
Contribution
It presents a novel NoC design with Direct Compute Access (DCA) for high-bandwidth in-network reductions, supporting scalable ML workloads with minimal area overhead.
Findings
Achieves 5.3x speedup in multicast operations
Achieves 2.8x speedup in reduction operations
Scales efficiently to large mesh architectures with significant performance gains
Abstract
The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.9% router area overhead. Through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
