Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong; Boyang Zheng; Ziteng Wang; Bingda Tang; Nanye Ma; Ellis Brown; Jihan Yang; Rob Fergus; Yann LeCun; Saining Xie

arXiv:2601.16208·cs.CV·January 23, 2026

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Shengbang Tong, Boyang Zheng, Ziteng Wang, Bingda Tang, Nanye Ma, Ellis Brown, Jihan Yang, Rob Fergus, Yann LeCun, Saining Xie

PDF

Open Access 3 Models 1 Datasets

TL;DR

This paper demonstrates that Representation Autoencoders (RAEs) can be scaled effectively for large-scale text-to-image diffusion models, outperforming VAEs in fidelity, stability, and convergence speed, and enabling unified multimodal reasoning.

Contribution

The work shows that scaling RAEs simplifies the diffusion framework and improves performance over VAEs for large-scale text-to-image generation, with better stability and convergence.

Findings

01

RAEs outperform VAEs across all model scales during pretraining.

02

RAEs remain stable during extended finetuning, unlike VAEs which overfit.

03

RAEs achieve faster convergence and higher quality in image generation.

Abstract

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale, freeform text-to-image (T2I) generation. We first scale RAE decoders on the frozen representation encoder (SigLIP-2) beyond ImageNet by training on web, synthetic, and text-rendering data, finding that while scale improves general fidelity, targeted data composition is essential for specific domains like text. We then rigorously stress-test the RAE design choices originally proposed for ImageNet. Our analysis reveals that scaling simplifies the framework: while dimension-dependent noise scheduling remains critical, architectural complexities such as wide diffusion heads and noise-augmented decoding offer negligible benefits at scale Building on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

nyu-visionx/scale-rae-data
dataset· 42k dl
42k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neuroimaging Techniques and Applications · Domain Adaptation and Few-Shot Learning