Efficiently Scaling Transformer Inference

Reiner Pope; Sholto Douglas; Aakanksha Chowdhery; Jacob Devlin; James; Bradbury; Anselm Levskaya; Jonathan Heek; Kefan Xiao; Shivani Agrawal; Jeff; Dean

arXiv:2211.05102·cs.LG·November 10, 2022·57 cites

Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James, Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff, Dean

PDF

Open Access

TL;DR

This paper presents a comprehensive approach to optimizing Transformer inference for large models, achieving significant latency reductions and efficiency improvements through analytical modeling, partitioning strategies, and low-level optimizations, enabling longer context lengths and faster generation.

Contribution

The paper introduces a new analytical model and optimized partitioning techniques for Transformer inference, enabling efficient scaling, longer contexts, and improved latency on large models like PaLM 540B.

Findings

01

Achieved a 29ms per token latency during generation.

02

Attained 76% model FLOPS utilization during large-batch processing.

03

Supported 2048-token context length on a 540B parameter model.

Abstract

We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Topic Modeling

MethodsMulti-Head Attention · Attention Is All You Need · Pathways Language Model · Label Smoothing · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Linear Layer · Adam · Absolute Position Encodings