TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs
Diyou Shen, Yichao Zhang, Marco Bertuletti, Luca Benini

TL;DR
The paper introduces TCDM Burst Access, a hardware solution that significantly enhances bandwidth utilization in large-scale shared-L1 RVV clusters, enabling better performance and energy efficiency for deep learning workloads.
Contribution
It presents a software-transparent burst transaction architecture that improves bandwidth in multi-core clusters with shared L1 memory, demonstrating substantial gains across various core counts.
Findings
Bandwidth improved by up to 226% in large clusters
Achieves up to 80% of cores-memory peak bandwidth
Up to 2.76x performance and 1.9x energy efficiency gains
Abstract
As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPower Line Communications and Noise
