Enabling High-Capacity, Latency-Tolerant, and Highly-Concurrent GPU   Register Files via Software/Hardware Cooperation

Mohammad Sadrosadati; Amirhossein Mirhosseini; Ali Hajiabadi; Seyed; Borna Ehsani; Hajar Falahati; Hamid Sarbazi-Azad; Mario Drumond; Babak; Falsafi; Rachata Ausavarungnirun; Onur Mutlu

arXiv:2010.09330·cs.AR·October 20, 2020

Enabling High-Capacity, Latency-Tolerant, and Highly-Concurrent GPU Register Files via Software/Hardware Cooperation

Mohammad Sadrosadati, Amirhossein Mirhosseini, Ali Hajiabadi, Seyed, Borna Ehsani, Hajar Falahati, Hamid Sarbazi-Azad, Mario Drumond, Babak, Falsafi, Rachata Ausavarungnirun, Onur Mutlu

PDF

Open Access

TL;DR

This paper introduces LTRF, a software/hardware cooperative architecture that prefetches register working-sets to reduce latency and enable larger register files in GPUs, significantly improving performance.

Contribution

It proposes a novel LTRF architecture with compile-time analysis and register renumbering to prefetch register data, reducing latency and enabling high-capacity GPU register files.

Findings

01

Enables 8x larger register files with emerging memory technologies.

02

Improves GPU performance by 34% through latency reduction.

03

Reduces register bank conflicts with compile-time renumbering.

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this paper, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp's aggregate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Interconnection Networks and Systems