Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas, K\"ubler, Jiaji Huang, Matth\"aus Kleindessner, Jun Huan, Volkan, Cevher, Yida Wang, George Karypis

TL;DR
This paper discusses techniques to optimize the inference of large Transformer-based foundation models on AI accelerators, focusing on system, architectural, and model compression strategies to reduce costs and latency.
Contribution
It provides a comprehensive overview of inference optimization methods tailored for AI accelerators, including system, architectural, and model compression techniques.
Findings
Enhanced attention computation efficiency on accelerators
Key architectural elements for fast transformer inference identified
Effective model compression and decoding strategies discussed
Abstract
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReservoir Engineering and Simulation Methods · Advanced Computational Techniques and Applications · Advanced Data Processing Techniques
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax
