TL;DR
This paper identifies a collapse mechanism in very deep diffusion transformers caused by mean-dominated homogenization, and proposes a residual modification to enable stable training at 1000 layers.
Contribution
It introduces Mean Mode Screaming (MMS) as a collapse trigger and proposes Mean-Variance Split (MV-Split) Residuals to prevent it, enabling stable training of 1000-layer DiTs.
Findings
MV-Split prevents divergence in 400-layer DiT.
The 1000-layer DiT remains trainable without collapse.
MV-Split outperforms token-isotropic gating methods.
Abstract
Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
