A Full-Stack Search Technique for Domain Optimized Deep Learning   Accelerators

Dan Zhang; Safeen Huda; Ebrahim Songhori; Kartik Prabhu; Quoc Le; Anna; Goldie; Azalia Mirhoseini

arXiv:2105.12842·cs.LG·February 2, 2022

A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators

Dan Zhang, Safeen Huda, Ebrahim Songhori, Kartik Prabhu, Quoc Le, Anna, Goldie, Azalia Mirhoseini

PDF

TL;DR

This paper introduces FAST, a comprehensive search framework for designing domain-optimized deep learning accelerators that significantly improve performance and efficiency for specific workloads in datacenter environments.

Contribution

The paper presents FAST, a full-stack search framework that optimizes hardware and software design decisions for deep learning accelerators tailored to specific workloads.

Findings

01

FAST accelerators improve Perf/TDP by 3.7x on average for single workloads.

02

FAST accelerators improve Perf/TDP by 2.4x on average for multi-workload serving.

03

FAST-generated accelerators are practical for moderate-sized datacenter deployments.

Abstract

The rapidly-changing deep learning landscape presents a unique opportunity for building inference accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding. In this paper, we analyze bottlenecks in state-of-the-art vision and natural language processing (NLP) models, including EfficientNet and BERT, and use FAST to design accelerators capable of addressing these bottlenecks. FAST-generated accelerators optimized for single workloads improve Perf/TDP by 3.7x on average across all benchmarks compared to TPU-v3. A FAST-generated accelerator optimized for serving a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · *Communicated@Fast*How Do I Communicate to Expedia? · Linear Layer · Residual Connection · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Weight Decay · Linear Warmup With Linear Decay · Layer Normalization