RAE-AR: Taming Autoregressive Models with Representation Autoencoders
Hu Yu, Hang Xu, Jie Huang, Zeyue Xue, Haoyang Huang, Nan Duan, Feng Zhao

TL;DR
This paper explores integrating high-dimensional representation autoencoders into autoregressive models, addressing key challenges with novel techniques to improve generative performance and unify visual understanding architectures.
Contribution
It introduces methods like distribution normalization and noise injection to effectively incorporate representation autoencoders into autoregressive models, bridging a performance gap.
Findings
Representation autoencoders achieve comparable results to VAEs in AR models.
Token simplification via distribution normalization improves training convergence.
Gaussian noise injection enhances robustness and reduces exposure bias.
Abstract
The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored. In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as \textit{RAE-AR}. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
