TL;DR
DecQ introduces detail-condensing queries to enhance reconstruction and generation in Representation Autoencoders by effectively balancing the trade-off through intermediate feature aggregation.
Contribution
It proposes a lightweight framework with detail-condensing queries that improve both reconstruction quality and generative performance in RAEs.
Findings
Improves PSNR from 19.13 dB to 22.76 dB with minimal extra computation.
Achieves 3.3× faster convergence in generative modeling.
Attains an FID of 1.41 without guidance and 1.05 with guidance.
Abstract
Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial reconstruction capacity, limiting fine-grained generation and image editing; in contrast, incorporating reconstruction-oriented signals via fine-tuning disrupts the pretrained semantic space and degrades generative fidelity. To address this trade-off, we propose DecQ, a simple yet effective framework for RAEs. Specifically, DecQ introduces lightweight detail-condensing queries that extract fine-grained information from intermediate VFM features through condenser modules. These queries are incorporated into the decoder to support reconstruction and are jointly generated with patch tokens during…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
