Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu; Shen Zhang; Ruijing Shi; Shanghua Gao; Zhenyuan Chen; Lei Wang; Zhaowei Chen; Hongcheng Gao; Yao Tang; Jian Yang; Ming-Ming Cheng; Xiang Li

arXiv:2507.01467·cs.CV·September 30, 2025

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, Ming-Ming Cheng, Xiang Li

PDF

Open Access 1 Video

TL;DR

This paper introduces Representation Entanglement for Generation (REG), a simple method that entangles image latents with high-level class tokens from pretrained models, significantly improving diffusion model training efficiency and generation quality with minimal inference overhead.

Contribution

The paper proposes REG, a novel technique that entangles low-level image latents with high-level class tokens, enabling faster training and better generation quality in diffusion models.

Findings

01

Achieves 63x and 23x faster training on ImageNet compared to baseline methods.

02

Produces coherent image-class pairs directly from noise with minimal additional inference cost.

03

Outperforms longer-trained models with significantly fewer training iterations.

Abstract

REPA and its variants effectively mitigate training challenges in diffusion models by incorporating external visual representations from pretrained models, through alignment between the noisy hidden projections of denoising networks and foundational clean image representations. We argue that the external alignment, which is absent during the entire denoising inference process, falls short of fully harnessing the potential of discriminative representations. In this work, we propose a straightforward method called Representation Entanglement for Generation (REG), which entangles low-level image latents with a single high-level class token from pretrained foundation models for denoising. REG acquires the capability to produce coherent image-class pairs directly from pure noise, substantially improving both generation quality and training efficiency. This is accomplished with negligible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Representation Entanglement for Generation: Training Diffusion Transformers Is Much Easier Than You Think· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Domain Adaptation and Few-Shot Learning