Exploring Magnitude Preservation and Rotation Modulation in Diffusion Transformers
Eric Tillman Bill, Cristian Perez Jensen, Sotiris Anagnostidis, Dimitri von R\"utte

TL;DR
This paper introduces a magnitude-preserving design and rotation modulation for diffusion transformers, improving training stability and performance while reducing parameters, with potential insights into conditioning strategies.
Contribution
It proposes a novel magnitude-preserving approach and rotation modulation for diffusion transformers, enhancing training stability and efficiency without normalization layers.
Findings
Reduced FID scores by approximately 12.8%.
Rotation modulation with scaling is competitive with AdaLN.
Fewer parameters needed compared to existing methods.
Abstract
Denoising diffusion models exhibit remarkable generative capabilities, but remain challenging to train due to their inherent stochasticity, where high-variance gradient estimates lead to slow convergence. Previous works have shown that magnitude preservation helps with stabilizing training in the U-net architecture. This work explores whether this effect extends to the Diffusion Transformer (DiT) architecture. As such, we propose a magnitude-preserving design that stabilizes training without normalization layers. Motivated by the goal of maintaining activation magnitudes, we additionally introduce rotation modulation, which is a novel conditioning method using learned rotations instead of traditional scaling or shifting. Through empirical evaluations and ablation studies on small-scale models, we show that magnitude-preserving strategies significantly improve performance, notably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Neural Networks and Reservoir Computing · Magneto-Optical Properties and Applications
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Concatenated Skip Connection · Dense Connections · Max Pooling · Convolution · Softmax
