Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Alicia Golden; Samuel Hsia; Fei Sun; Bilge Acun; Basil Hosmer; Yejin; Lee; Zachary DeVito; Jeff Johnson; Gu-Yeon Wei; David Brooks; Carole-Jean Wu

arXiv:2312.14385·cs.DC·May 7, 2024·1 cites

Generative AI Beyond LLMs: System Implications of Multi-Modal Generation

Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin, Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

PDF

Open Access

TL;DR

This paper systematically analyzes the system design and performance of multi-modal generative AI models for images and videos, highlighting unique challenges and optimization opportunities beyond traditional LLMs.

Contribution

It provides the first comprehensive performance characterization of multi-modal text-to-image and text-to-video models, revealing key bottlenecks and differences from LLMs.

Findings

01

Convolution accounts for up to 44% of execution time in Diffusion TTI models.

02

Linear layers consume up to 49% of execution time in Transformer TTI models.

03

Temporal Attention in TTV workloads accounts for over 60% of total Attention time.

Abstract

As the development of large-scale Generative AI models evolve beyond text (1D) generation to include image (2D) and video (3D) generation, processing spatial and temporal information presents unique challenges to quality, performance, and efficiency. We present the first work towards understanding this new system design space for multi-modal text-to-image (TTI) and text-to-video (TTV) generation models. Current model architecture designs are bifurcated into 2 categories: Diffusion- and Transformer-based models. Our systematic performance characterization on a suite of eight representative TTI/TTV models shows that after state-of-the-art optimization techniques such as Flash Attention are applied, Convolution accounts for up to 44% of execution time for Diffusion-based TTI models, while Linear layers consume up to 49% of execution time for Transformer-based models. We additionally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Video Analysis and Summarization

MethodsConvolution · Diffusion