FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality   Samples with Less Compute

Sotiris Anagnostidis; Gregor Bachmann; Yeongmin Kim; Jonas Kohler,; Markos Georgopoulos; Artsiom Sanakoyeu; Yuming Du; Albert Pumarola; Ali; Thabet; Edgar Sch\"onfeld

arXiv:2502.20126·cs.LG·February 28, 2025

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler,, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali, Thabet, Edgar Sch\"onfeld

PDF

Open Access

TL;DR

FlexiDiT introduces a dynamic compute strategy for diffusion transformers, enabling high-quality image and video generation with significantly reduced computational costs while maintaining performance.

Contribution

This work presents FlexiDiT, a flexible diffusion transformer framework that adapts compute during inference, reducing resource usage by over 40% without sacrificing quality.

Findings

01

Reduces FLOPs by over 40% for image generation.

02

Enables flexible compute during inference without quality loss.

03

Extends to video generation with up to 75% less compute.

Abstract

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$ \% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Computer Graphics and Visualization Techniques

MethodsDiffusion