Representation Alignment for Just Image Transformers is not Easier than You Think

Jaeyo Shin; Jiwook Kim; and Hyunjung Shim

arXiv:2603.14366·cs.CV·March 17, 2026

Representation Alignment for Just Image Transformers is not Easier than You Think

Jaeyo Shin, Jiwook Kim, and Hyunjung Shim

PDF

Open Access

TL;DR

This paper demonstrates that representation alignment methods like REPA can fail for Just Image Transformers (JiT) due to information asymmetry, and introduces PixelREPA, a modified approach that improves training stability and image quality.

Contribution

The paper identifies the limitations of REPA for JiT and proposes PixelREPA, a novel alignment method that enhances training convergence and image generation quality.

Findings

01

PixelREPA reduces FID from 3.66 to 3.17 on ImageNet.

02

PixelREPA improves Inception Score from 275.1 to 284.6.

03

PixelREPA achieves over 2x faster convergence.

Abstract

Representation Alignment (REPA) has emerged as a simple way to accelerate Diffusion Transformers training in latent space. At the same time, pixel-space diffusion transformers such as Just image Transformers (JiT) have attracted growing attention because they remove a dependency on a pretrained tokenizer, and then avoid the reconstruction bottleneck of latent diffusion. This paper shows that the REPA can fail for JiT. REPA yields worse FID for JiT as training proceeds and collapses diversity on image subsets that are tightly clustered in the representation space of pretrained semantic encoder on ImageNet. We trace the failure to an information asymmetry: denoising occurs in the high dimensional image space, while the semantic target is strongly compressed, making direct regression a shortcut objective. We propose PixelREPA, which transforms the alignment target and constrains alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning