Representation Alignment for Just Image Transformers is not Easier than You Think
Jaeyo Shin, Jiwook Kim, and Hyunjung Shim

TL;DR
This paper demonstrates that representation alignment methods like REPA can fail for Just Image Transformers (JiT) due to information asymmetry, and introduces PixelREPA, a modified approach that improves training stability and image quality.
Contribution
The paper identifies the limitations of REPA for JiT and proposes PixelREPA, a novel alignment method that enhances training convergence and image generation quality.
Findings
PixelREPA reduces FID from 3.66 to 3.17 on ImageNet.
PixelREPA improves Inception Score from 275.1 to 284.6.
PixelREPA achieves over 2x faster convergence.
Abstract
Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
