AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training
Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun, Wen Huang, Wanting Xu, Jing Long, Shuai Di, Junwu Xiong

TL;DR
AdaptiveLoad introduces a dual-constraint load balancing framework and a specialized CUDA kernel to optimize training efficiency of large-scale video diffusion Transformers, addressing computational load imbalance and memory bottlenecks.
Contribution
It presents a novel adaptive load balancing system and a fused CUDA kernel to improve GPU utilization and training throughput for video diffusion models.
Findings
Reduced computational imbalance rate from 39% to 18.9%
Improved peak VRAM utilization efficiency by 22.7%
Achieved 27.2% increase in training throughput
Abstract
In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textit{AdaptiveLoad}, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load (); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
