Rethinking Cross-Layer Information Routing in Diffusion Transformers

Chao Xu; Maohua Li; Qirui Li; Yixuan Xu; Yanke Zhou; Yunhe Li; Cuifeng Shen; Hanlin Tang; Kan Liu; Tao Lan; Lin Qu; Shao-Qun Zhang

arXiv:2605.20708·cs.CV·May 21, 2026

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

PDF

TL;DR

This paper analyzes cross-layer information flow in Diffusion Transformers, identifies issues with traditional residual connections, and proposes Diffusion-Adaptive Routing (DAR) to improve training efficiency and model quality.

Contribution

It introduces DAR, a learnable, timestep-adaptive residual mechanism, compatible with modern Transformer enhancements, improving diffusion model training and quality.

Findings

01

DAR improves FID by 2.11 on ImageNet 256x256

02

DAR reduces training iterations by 8.75 times

03

Stacked with REPA, DAR accelerates early training stages

Abstract

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.