Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and   Energy Efficiency

Hansung Kim; Ruohan Richard Yan; Joshua You; Tieliang Vamber Yang,; Yakun Sophia Shao

arXiv:2408.12073·cs.AR·March 4, 2025

Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency

Hansung Kim, Ruohan Richard Yan, Joshua You, Tieliang Vamber Yang,, Yakun Sophia Shao

PDF

Open Access

TL;DR

Virgo introduces a novel GPU microarchitecture that disaggregates matrix units from SIMT cores, enhancing scalability and energy efficiency for deep learning workloads by increasing operation granularity and reducing power consumption.

Contribution

The paper proposes Virgo, a GPU design that decouples matrix units from SIMT cores, enabling scalable, energy-efficient matrix operations at the cluster level.

Findings

01

Achieves 67.3% reduction in on-chip active power compared to Ampere-style cores.

02

Achieves 24.2% reduction in on-chip active power compared to Hopper-style cores.

03

Supports efficient concurrent execution of matrix units and SIMT cores.

Abstract

Modern GPUs incorporate specialized matrix units such as Tensor Cores to accelerate GEMM operations, which are central to deep learning workloads. However, existing matrix unit designs are tightly coupled to the SIMT core, restricting operation size due to register file capacity and bandwidth constraints. Such a limitation in scalability makes it difficult to simultaneously improve compute throughput and energy efficiency in GPUs. To address this challenge, we propose Virgo, a GPU microarchitecture that integrates dedicated matrix units at the SIMT core cluster level. By decoupling the matrix unit from the SIMT core, Virgo eliminates scalability constraints imposed by the core microarchitecture. Consequently, Virgo increases operation granularity at the hardware level, reducing energy overhead from core instruction processing. Physical disaggregation also enables a unified matrix unit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Interconnection Networks and Systems