TL;DR
This paper uncovers a layer-collapse phenomenon in diffusion language models, showing that early layers develop redundant, dominated activation patterns that are crucial for performance and influence compression strategies.
Contribution
It reveals that layer collapse in DLMs is driven by overtraining, not undertraining, and demonstrates the implications for model compression and deployment strategies.
Findings
Layer collapse involves a dominant outlier critical for model output.
DLMs are more robust to quantization and pruning than autoregressive models.
Optimal sparsity allocation varies significantly between DLMs and AR models.
Abstract
Diffusion language models (DLMs) have recently emerged as competitive alternatives to autoregressive (AR) language models, yet differences in their activation dynamics remain poorly understood. We characterize these dynamics in LLaDA-8B and identify a striking layer-collapse property: a few early layers exhibit highly similar, collapsed activation patterns dominated by a single large super-outlier persisting over a long token range. Despite its apparent redundancy, this outlier is critical: pruning it causes outputs to degrade into repetitive random token loops. Paradoxically, layers in LLaDA contain more redundant representations overall, with redundancy most pronounced in earlier layers -- the reverse of AR models, where deeper layers grow redundant due to undertraining. Our analysis indicates that layer collapse in DLMs is not driven by undertraining but by overtraining: a dominant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
