RAE-AR: Taming Autoregressive Models with Representation Autoencoders

Hu Yu; Hang Xu; Jie Huang; Zeyue Xue; Haoyang Huang; Nan Duan; Feng Zhao

arXiv:2604.01545·cs.AI·April 3, 2026

RAE-AR: Taming Autoregressive Models with Representation Autoencoders

Hu Yu, Hang Xu, Jie Huang, Zeyue Xue, Haoyang Huang, Nan Duan, Feng Zhao

PDF

TL;DR

This paper explores integrating high-dimensional representation autoencoders into autoregressive models, addressing key challenges with novel techniques to improve generative performance and unify visual understanding architectures.

Contribution

It introduces methods like distribution normalization and noise injection to effectively incorporate representation autoencoders into autoregressive models, bridging a performance gap.

Findings

01

Representation autoencoders achieve comparable results to VAEs in AR models.

02

Token simplification via distribution normalization improves training convergence.

03

Gaussian noise injection enhances robustness and reduces exposure bias.

Abstract

The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored. In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as \textit{RAE-AR}. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.