
TL;DR
SR-DiT introduces a combined approach to improve diffusion transformer training efficiency, achieving state-of-the-art results on ImageNet-256 with a small model and limited training iterations.
Contribution
The paper presents SR-DiT, a framework that systematically combines multiple techniques to enhance diffusion transformer training efficiency and performance.
Findings
Achieved FID 3.49 and KDD 0.319 on ImageNet-256 with a 140M parameter model.
Matched larger models trained longer without classifier-free guidance.
Identified effective technique combinations and documented their synergies and incompatibilities.
Abstract
Recent advances have significantly improved the training efficiency of diffusion transformers. However, these techniques have largely been studied in isolation, leaving unexplored the potential synergies from combining multiple approaches. We present SR-DiT (Speedrun Diffusion Transformer), a framework that systematically integrates token routing, architectural improvements, and training modifications on top of representation alignment. Our approach achieves FID 3.49 and KDD 0.319 on ImageNet-256 using only a 140M parameter model at 400K iterations without classifier-free guidance - comparable to results from 685M parameter models trained significantly longer. To our knowledge, this is a state-of the-art result at this model size. Through extensive ablation studies, we identify which technique combinations are most effective and document both synergies and incompatibilities. We release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Advanced Memory and Neural Computing
