No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Dengyang Jiang; Mengmeng Wang; Liuzhuozheng Li; Lei Zhang; Haoyu Wang; Wei Wei; Guang Dai; Yanning Zhang; Jingdong Wang

arXiv:2505.02831·cs.CV·January 27, 2026

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, Jingdong Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces SelfRepresentation Alignment (SRA), a method enabling diffusion transformers to self-generate guidance through internal representations, eliminating the need for external encoders and improving generative training efficiency.

Contribution

The study demonstrates that diffusion transformers can internally provide representation guidance without external components, using SRA to align internal features during training.

Findings

01

SRA improves diffusion transformer performance across experiments.

02

Diffusion transformers with SRA outperform methods using external encoders.

03

The approach achieves results comparable to external encoder-based methods.

Abstract

Recent studies have demonstrated that learning a meaningful internal representation can accelerate generative training. However, existing approaches necessitate to either introduce an off-the-shelf external representation task or rely on a large-scale, pre-trained external representation encoder to provide representation guidance during the training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We propose SelfRepresentation Alignment (SRA), a simple yet effective method that obtains representation guidance using the internal representations of learned diffusion transformer. SRA aligns the latent representation of the diffusion transformer in the earlier layer conditioned on higher noise to that in the later layer conditioned on lower…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vvvvvjdy/sra
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Quantum many-body systems

MethodsDiffusion