TL;DR
This paper introduces RAEv2, an improved version of Representation Autoencoders that leverages new design choices, leading to faster training, better image generation quality, and broader applicability across tasks.
Contribution
The paper systematically investigates design choices in RAE, introduces RAEv2 with key improvements, and demonstrates significant speed and quality enhancements in generative modeling.
Findings
RAEv2 achieves over 10x faster convergence than original RAE.
RAEv2 attains a state-of-the-art gFID of 1.06 in 80 epochs on ImageNet-256.
RAEv2 outperforms previous methods on FDr^k with a score of 2.17 at 80 epochs.
Abstract
Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
