Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Ibrahim Ahmed; Sahil Parmar; Matthew Boyd; Michael Beidler; Kris Kang,; Bill Liu; Kyle Roach; John Kim; Dennis Abts

arXiv:2206.11062·cs.LG·June 23, 2022·1 cites

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Beidler, Kris Kang,, Bill Liu, Kyle Roach, John Kim, Dennis Abts

PDF

Open Access

TL;DR

This paper presents a method to accelerate BERT inference on a tensor streaming processor by fusing nonlinear components with matrix multiplications, achieving a 6x speedup and deterministic low latency.

Contribution

It introduces a novel fusion technique for nonlinear and matrix components to optimize BERT inference on specialized hardware.

Findings

01

Deterministic tail latency of 130 μs for batch-1 BERT-base inference

02

6x faster inference compared to state-of-the-art methods

03

Efficient utilization of on-chip matrix multiplication units

Abstract

Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Tensor decomposition and applications · Quantum Computing Algorithms and Architecture

MethodsAttention Is All You Need · Linear Layer · Softmax · Dropout · Linear Warmup With Linear Decay · Multi-Head Attention · Weight Decay · Residual Connection · Layer Normalization · Adam