GPU Domain Specialization via Composable On-Package Architecture

Yaosheng Fu; Evgeny Bolotin; Niladrish Chatterjee; David Nellans,; Stephen W. Keckler

arXiv:2104.02188·cs.AR·April 7, 2021

GPU Domain Specialization via Composable On-Package Architecture

Yaosheng Fu, Evgeny Bolotin, Niladrish Chatterjee, David Nellans,, Stephen W. Keckler

PDF

Open Access

TL;DR

This paper proposes a composable GPU architecture, COPA-GPU, that specializes memory and compute resources for deep learning and HPC workloads, improving performance and efficiency over traditional converged GPU designs.

Contribution

Introduction of COPA-GPU, a modular, domain-specific GPU architecture leveraging multi-chip-module disaggregation for optimized deep learning and HPC performance.

Findings

01

DL-optimized COPA-GPU achieves 31% higher training performance.

02

Increases inference performance by 35%.

03

Reduces GPU instances by 50% in scale-out training.

Abstract

As GPUs scale their low precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that converged GPU design trying to address diverging architectural requirements between FP32 (or larger) based HPC and FP16 (or smaller) based DL workloads results in sub-optimal configuration for either of the application domains. We argue that a Composable On-PAckage GPU (COPAGPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4x higher off-die bandwidth,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Parallel Computing and Optimization Techniques · Advanced Memory and Neural Computing