Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Xunzhi Xiang; Xingye Tian; Guiyu Zhang; Yabo Chen; Shaofeng Zhang; Xuebo Wang; Xin Tao; Qi Fan

arXiv:2511.12633·cs.CV·November 18, 2025

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Xunzhi Xiang, Xingye Tian, Guiyu Zhang, Yabo Chen, Shaofeng Zhang, Xuebo Wang, Xin Tao, Qi Fan

PDF

Open Access

TL;DR

This paper introduces a spectral self-regularization method for Denoising-VAE that reduces high-frequency noise in latent spaces, leading to faster convergence and improved image reconstruction and generation quality.

Contribution

It proposes a novel spectral self-regularization strategy for ViT-based autoencoders that enhances generative performance without relying on external foundation models.

Findings

01

Denoising-VAE produces cleaner, lower-noise latents.

02

Generative models converge approximately 2× faster.

03

Achieves state-of-the-art reconstruction quality on ImageNet.

Abstract

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers generative performance. Recent methods attempt to address this dilemma by regularizing high-dimensional latent spaces using external vision foundation models (VFMs). However, it remains unclear how high-dimensional VAE latents affect the optimization of generative models. To our knowledge, our analysis is the first to reveal that redundant high-frequency components in high-dimensional latent spaces hinder the training convergence of diffusion models and, consequently, degrade generation quality. To alleviate this problem, we propose a spectral self-regularization strategy to suppress redundant high-frequency noise while simultaneously preserving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Face Recognition and Perception