Inference Optimization of Foundation Models on AI Accelerators

Youngsuk Park; Kailash Budhathoki; Liangfu Chen; Jonas; K\"ubler; Jiaji Huang; Matth\"aus Kleindessner; Jun Huan; Volkan; Cevher; Yida Wang; George Karypis

arXiv:2407.09111·cs.AI·October 2, 2024·1 cites

Inference Optimization of Foundation Models on AI Accelerators

Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas, K\"ubler, Jiaji Huang, Matth\"aus Kleindessner, Jun Huan, Volkan, Cevher, Yida Wang, George Karypis

PDF

Open Access

TL;DR

This paper discusses techniques to optimize the inference of large Transformer-based foundation models on AI accelerators, focusing on system, architectural, and model compression strategies to reduce costs and latency.

Contribution

It provides a comprehensive overview of inference optimization methods tailored for AI accelerators, including system, architectural, and model compression techniques.

Findings

01

Enhanced attention computation efficiency on accelerators

02

Key architectural elements for fast transformer inference identified

03

Effective model compression and decoding strategies discussed

Abstract

Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReservoir Engineering and Simulation Methods · Advanced Computational Techniques and Applications · Advanced Data Processing Techniques

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax