Dissecting the NVidia Turing T4 GPU via Microbenchmarking

Zhe Jia; Marco Maggioni; Jeffrey Smith; Daniele Paolo Scarpazza

arXiv:1903.07486·cs.DC·March 19, 2019·73 cites

Dissecting the NVidia Turing T4 GPU via Microbenchmarking

Zhe Jia, Marco Maggioni, Jeffrey Smith, Daniele Paolo Scarpazza

PDF

Open Access

TL;DR

This paper provides a detailed microbenchmarking analysis of the Nvidia Turing T4 GPU, revealing architectural features, performance improvements over previous generations, and insights into its instruction set and memory hierarchy to aid software optimization.

Contribution

It offers the first comprehensive microarchitectural dissection of the Turing T4 GPU, highlighting new instructions, memory hierarchy details, and performance characteristics compared to prior Nvidia GPUs.

Findings

01

Turing introduces new instructions for matrix math.

02

T4 GPU has larger cache levels than Pascal P4.

03

Performance benchmarks show substantial improvements over P4.

Abstract

In 2019, the rapid rate at which GPU manufacturers refresh their designs, coupled with their reluctance to disclose microarchitectural details, is still a hurdle for those software designers who want to extract the highest possible performance. Last year, these very reasons motivated us to dissect the Volta GPU architecture using microbenchmarks. The introduction in August 2018 of Turing, NVidia's latest architecture, pressed us to update our study. In this report, we examine Turing and compare it quantitatively against previous NVidia GPU generations. Specifically, we study the T4 GPU: a low-power board aiming at inference applications. We describe its improvements against its inference-oriented predecessor: the P4 GPU based on the Pascal architecture. Both T4 and P4 GPUs achieve significantly higher frequency-per-Watt figures than their full-size counterparts. We study the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Advanced Memory and Neural Computing · Parallel Computing and Optimization Techniques