In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi; Cliff Young; Nishant Patil; David Patterson; Gaurav; Agrawal; Raminder Bajwa; Sarah Bates; Suresh Bhatia; Nan Boden; Al Borchers,; Rick Boyle; Pierre-luc Cantin; Clifford Chao; Chris Clark; Jeremy Coriell,; Mike Daley; Matt Dau; Jeffrey Dean; Ben Gelb; Tara Vazir Ghaemmaghami,; Rajendra Gottipati; William Gulland; Robert Hagmann; C. Richard Ho; Doug; Hogberg; John Hu; Robert Hundt; Dan Hurt; Julian Ibarz; Aaron Jaffey; Alek; Jaworski; Alexander Kaplan; Harshit Khaitan; Andy Koch; Naveen Kumar; Steve; Lacy; James Laudon; James Law; Diemthu Le; Chris Leary; Zhuyuan Liu; Kyle; Lucke; Alan Lundin; Gordon MacKean; Adriana Maggiore; Maire Mahony; Kieran; Miller; Rahul Nagarajan; Ravi Narayanaswami; Ray Ni; Kathy Nix; Thomas; Norrie; Mark Omernick; Narayana Penukonda; Andy Phelps; Jonathan Ross; Matt; Ross; Amir Salek; Emad Samadiani; Chris Severn; Gregory Sizikov; Matthew; Snelham; Jed Souter; Dan Steinberg; Andy Swing; Mercedes Tan; Gregory; Thorson; Bo Tian; Horia Toma; Erick Tuttle; Vijay Vasudevan; Richard Walter,; Walter Wang; Eric Wilcox; and Doe Hyun Yoon

arXiv:1704.04760·cs.AR·April 18, 2017·23 cites

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav, Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers,, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell,, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb

PDF

Open Access

TL;DR

This paper evaluates a custom Tensor Processing Unit (TPU) designed for neural network inference, demonstrating significant performance and energy efficiency advantages over CPUs and GPUs in datacenter workloads.

Contribution

It provides a detailed performance and energy efficiency analysis of the TPU, highlighting its deterministic execution model and advantages over traditional hardware.

Findings

01

TPU is 15-30X faster than CPU and GPU for inference tasks.

02

TPU achieves 30X-80X higher TOPS/Watt efficiency.

03

Using GDDR5 memory could triple TPU performance and increase efficiency.

Abstract

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Memory and Neural Computing · Advanced Data Storage Technologies