Full Stack Optimization of Transformer Inference: a Survey
Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan, Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney,, Yakun Sophia Shao, Amir Gholami

TL;DR
This survey reviews various methods for optimizing Transformer inference, including architecture analysis, hardware implications, and neural architecture search, demonstrating significant speedups through full-stack co-design.
Contribution
It provides a comprehensive overview of Transformer inference optimization techniques and presents a case study showing up to 88.7x speedup with minimal performance loss.
Findings
Full-stack co-design can achieve up to 88.7x speedup.
Different optimization approaches complement each other.
Hardware-aware optimizations improve efficiency.
Abstract
Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Machine Learning in Materials Science · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Label Smoothing · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding
