TL;DR
TC-AE introduces a ViT-based deep compression autoencoder that improves reconstruction and generation by addressing token-to-latent compression challenges and enhancing semantic token structure.
Contribution
The paper presents a novel ViT-based autoencoder architecture that decomposes token-to-latent compression and uses joint self-supervised training to prevent latent collapse.
Findings
Achieves better reconstruction quality under high compression ratios.
Enhances generative performance with semantic token structure.
Addresses token-to-latent compression limitations effectively.
Abstract
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
