Benchmarking and Dissecting the Nvidia Hopper GPU Architecture
Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu

TL;DR
This paper provides a comprehensive benchmarking and analysis of Nvidia's Hopper GPU architecture, revealing its microarchitectural features, new instruction sets, and AI hardware units to aid software optimization.
Contribution
First detailed microbenchmarking study of Nvidia Hopper GPU, focusing on its novel features, instruction sets, and tensor core performance.
Findings
Hopper GPUs introduce new tensor cores supporting FP8 and DPX instructions.
Benchmarking reveals performance characteristics of Hopper's new shared memory and instruction sets.
Insights into Hopper's AI hardware units facilitate software optimization.
Abstract
Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A substantial body of studies have been dedicated to dissecting the microarchitectural metrics characterizing diverse GPU generations, which helps researchers understand the hardware details and leverage them to optimize the GPU programs. However, the latest Hopper GPUs present a set of novel attributes, including new tensor cores supporting FP8, DPX, and distributed shared memory. Their details still remain mysterious in terms of performance and operational characteristics. In this research, we propose an extensive benchmarking study focused on the Hopper GPU. The objective is to unveil its microarchitectural intricacies through an examination of the new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
