Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim; Coleman Hooper; Thanakul Wattanawong; Minwoo Kang; Ruohan; Yan; Hasan Genc; Grace Dinh; Qijing Huang; Kurt Keutzer; Michael W. Mahoney,; Yakun Sophia Shao; Amir Gholami

arXiv:2302.14017·cs.CL·August 23, 2023·27 cites

Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan, Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney,, Yakun Sophia Shao, Amir Gholami

PDF

Open Access

TL;DR

This survey reviews various methods for optimizing Transformer inference, including architecture analysis, hardware implications, and neural architecture search, demonstrating significant speedups through full-stack co-design.

Contribution

It provides a comprehensive overview of Transformer inference optimization techniques and presents a case study showing up to 88.7x speedup with minimal performance loss.

Findings

01

Full-stack co-design can achieve up to 88.7x speedup.

02

Different optimization approaches complement each other.

03

Hardware-aware optimizations improve efficiency.

Abstract

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Machine Learning in Materials Science · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Label Smoothing · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding