Dynamic Diffusion Transformer

Wangbo Zhao; Yizeng Han; Jiasheng Tang; Kai Wang; Yibing Song; Gao; Huang; Fan Wang; Yang You

arXiv:2410.03456·cs.CV·October 10, 2024

Dynamic Diffusion Transformer

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Kai Wang, Yibing Song, Gao, Huang, Fan Wang, Yang You

PDF

Open Access 2 Repos 3 Reviews

TL;DR

The paper introduces Dynamic Diffusion Transformer (DyDiT), which reduces computational costs in image generation by dynamically adjusting computation across timesteps and spatial regions, leading to faster generation and lower FLOPs.

Contribution

It proposes a novel architecture that dynamically adapts computation in diffusion models, significantly improving efficiency without sacrificing quality.

Findings

01

Reduces FLOPs of DiT-XL by 51%

02

Speeds up generation by 1.73 times

03

Achieves FID score of 2.07 on ImageNet

Abstract

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The motivation is sound and clearly presented, supported by a well-designed teaser figure. - The proposed TDW and SDT mechanisms enable dynamic adjustment of model modules, and the FLOPs-constrained loss effectively -controls the desired FLOPs of the final model. - Extensive experiments and thorough ablation studies validate the module's effectiveness.

Weaknesses

- It is unclear how the "pre-define" in L214 benefit the sampling stage? I understand that the activation of attention heads and groups is based solely on timesteps, allowing the masks to be precomputed once training is completed. However, tt seems impractical or inefficient to store all possible pre-defined structures, so it primarily saves computational costs on the attention routers. However, this cost doesn’t seem substantial—am I correct? - The adaptation of the proposed modules to efficien

Reviewer 02Rating 8Confidence 4

Strengths

- This paper is easy to follow. - The authors conduct sufficient ablation studies to evaluate the proposed modules. - The authors conduct experiments on a wide range of datasets, including ImageNet, Food, Artbench, Cars, and Birds, and compare a lot of state-of-the-art diffusion backbones. The results show the effectiveness of the proposed method. - The authors also perform experiments on text-to-image generation, demonstrating the plug-and-play nature of SDT and TDW.

Weaknesses

- The authors demonstrate the results of their method on PixArt-$\alpha$, which is commendable. However, the acceleration achieved in this text-to-image model is inferior to that in class-to-image generation. A more in-depth analysis of this discrepancy would be valuable. Moreover, providing image samples generated by the accelerated text-to-image model could be helpful for the analysis. - Providing the image samples generated by the diffusion models accelerated with **different** $\mathbf{\lam

Reviewer 03Rating 5Confidence 4

Strengths

- The proposed approach reduces GFLOPs by 51% and achieves a 1.73x speed-up during training. - Detailed ablation studies are presented to demonstrate the contribution of each component to overall performance.

Weaknesses

- The model appears to be somewhat incremental in its contributions. it trains multiple routers to selectively mask certain MHSA heads and MLP blocks. - I recommend including some state-of-the-art models in Table 1, such as DiffiT [1], SiT [2], and DiMR [3], as these also introduce architectural innovations to the DiT model. - Also, it would be beneficial to move the 512 result in supp into the main table (and also add other methods), as training speed is a more critical factor in larger-scale

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInduction Heating and Inverter Technology · Magnetic Properties and Applications

MethodsDense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Dropout · Byte Pair Encoding · Absolute Position Encodings