Latent Denoising Makes Good Tokenizers
Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang

TL;DR
This paper introduces the Latent Denoising Tokenizer (l-DeTok), a novel approach that aligns tokenizer embeddings with denoising objectives, significantly enhancing generative model performance on image and text benchmarks.
Contribution
We propose the l-DeTok, a tokenizer trained to reconstruct clean signals from corrupted latent embeddings, demonstrating improved generative quality across multiple models and benchmarks.
Findings
l-DeTok improves image generation quality on ImageNet and MSCOCO.
Denoising alignment enhances tokenizer effectiveness for generative tasks.
Consistent performance gains across six different generative models.
Abstract
Despite their fundamental role, it remains unclear what properties could make tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs, such as signals degraded by Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings that remain reconstructable even under significant corruption. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet highly effective tokenizer trained to reconstruct clean images from latent embeddings corrupted via interpolative noise or random masking. Extensive experiments on class-conditioned (ImageNet 256x256 and 512x512) and text-conditioned (MSCOCO)…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is motivated by an accurate and under-discussed observation: modern generative models, regardless of architecture, are fundamentally denoising systems. Training tokenizers with explicit latent corruption (interpolative noise, masking) is a clean conceptual shift that breaks with the tradition of mere pixel-wise autoencoding. This alignment is theoretically meaningful and empirically justified. - The authors benchmark $l$-DeTok across a broad spectrum of generative models, on both cl
- Lack of Theoretical Analysis Regarding Optimality or Limitations: The empirical link between denoising-aligned tokenizers and improved downstream performance is clear, but the theoretical rationale is underdeveloped. For example, there is no formal analysis or proof of why interpolative over additive noise leads to strictly more robust or generative-friendly latents (as claimed in Section 5.1.1). While Figure 2 empirically demonstrates this, a mathematical discussion (e.g., in terms of mutual
- The paper is well-motivated, addressing the critical challenge of aligning the training objectives of visual tokenizers and generative models. - The proposed method is simple yet effective. The strategy of injecting interpolative or masking noise is conceptually sound and well-justified. - The methodology and implementation details are presented with clarity, making the work easy to understand and reproduce. - The experimental evaluation is extensive and well-structured, providing strong empir
- Convergence and scalability: A potential concern is the training convergence. While the denoising objective complements the pixel-reconstruction loss, it is plausible that learning to reconstruct from corrupted latents could slow down convergence compared to a vanilla baseline. It would be beneficial for the authors to provide an analysis of the training speed and computational overhead. Furthermore, a discussion on the scalability of the proposed method to larger models and datasets would str
1. Generalizes across AR and non-AR generators. - The same tokenizer improves both diffusion-based (non-AR) and autoregressive models without architectural changes, indicating that the denoising-aligned latent space is broadly compatible with diverse generation mechanisms. 2. Simple and practical method such that no external encoder alignment or semantic distillation required. - While recent approaches emphasize semantics distillation from powerful pretrained vision models, l-DeTok shows that a
Regarding with Related work, please add the following references. - Zhao et al., ε-VAE: Denoising as Visual Decoding. - Tschannen et al., Generative Infinite-Vocabulary Transformers. - Kim et al., Efficient Generative Modeling with Residual Vector Quantization-Based Tokens.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Image Enhancement Techniques
