Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

Yejin Lee; Anna Sun; Basil Hosmer; Bilge Acun; Can Balioglu; Changhan Wang; Charles David Hernandez; Christian Puhrsch; Daniel Haziza; Driss Guessous; Francisco Massa; Jacob Kahn; Jeffrey Wan; Jeremy Reizenstein; Jiaqi Zhai; Joe Isaacson; Joel Schlosser; Juan Pino; Kaushik Ram Sadagopan; Leonid Shamis; Linjian Ma; Min-Jae Hwang; Mingda Chen; Mostafa Elhoushi; Pedro Rodriguez; Ram Pasunuru; Scott Yih; Sravya Popuri; Xing Liu; and Carole-Jean Wu

arXiv:2410.00215·cs.LG·May 13, 2025

Characterizing and Efficiently Accelerating Multimodal Generation Model Inference

Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino

PDF

Open Access

TL;DR

This paper analyzes the system design challenges of multimodal generative AI models, identifying bottlenecks like auto-regressive token generation and attention, and demonstrates optimization strategies that significantly improve inference efficiency.

Contribution

It characterizes multimodal generation models on real systems and proposes optimization techniques across applications, system software, and hardware to accelerate inference.

Findings

01

Auto-regressive token generation is a key latency bottleneck.

02

Memory-intensive attention and feed forward networks contribute to inference delays.

03

Optimization strategies can improve baseline performance by up to 3.88x.

Abstract

Generative artificial intelligence (AI) technology is revolutionizing the computing industry. Not only its applications have broadened to various sectors but also poses new system design and optimization opportunities. The technology is capable of understanding and responding in multiple modalities. However, the advanced capability currently comes with significant system resource demands. To sustainably scale generative AI capabilities to billions of users in the world, inference must be fast and efficient. This paper pinpoints key system design and optimization opportunities by characterizing a family of emerging multi-modal generation models on real systems. Auto-regressive token generation is a critical latency performance bottleneck, typically dominated by GPU idle time. In addition to memory-intensive attention across the generative AI models, linear operations constitute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training