Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

Pengqi Lu

arXiv:2605.06169·cs.LG·May 11, 2026

Mean Mode Screaming: Mean--Variance Split Residuals for 1000-Layer Diffusion Transformers

Pengqi Lu

PDF

1 Repo 1 Models

TL;DR

This paper identifies a collapse mechanism in very deep diffusion transformers caused by mean-dominated homogenization, and proposes a residual modification to enable stable training at 1000 layers.

Contribution

It introduces Mean Mode Screaming (MMS) as a collapse trigger and proposes Mean-Variance Split (MV-Split) Residuals to prevent it, enabling stable training of 1000-layer DiTs.

Findings

01

MV-Split prevents divergence in 400-layer DiT.

02

The 1000-layer DiT remains trainable without collapse.

03

MV-Split outperforms token-isotropic gating methods.

Abstract

Scaling Diffusion Transformers (DiTs) to hundreds of layers introduces a structural vulnerability: networks can enter a silent, mean-dominated collapse state that homogenizes token representations and suppresses centered variation. Through mechanistic auditing, we isolate the trigger event of this collapse as Mean Mode Screaming (MMS). MMS can occur even when training appears stable, with a mean-coherent backward shock on residual writers that opens deep residual branches and drives the network into a mean-dominated state. We show this behavior is driven by an exact decomposition of these gradients into mean-coherent and centered components, compounded by the structural suppression of attention-logit gradients through the null space of the Softmax Jacobian once values homogenize. To address this, we propose Mean-Variance Split (MV-Split) Residuals, which combine a separately gained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

erwold/mv-split
github

Models

🤗
StableKirito/mvsplit-dit-1000l
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.