TL;DR
WeTok introduces a novel discrete visual tokenizer with group-wise quantization and a generative decoder, achieving high-fidelity image reconstruction at unprecedented compression ratios, outperforming previous methods on ImageNet.
Contribution
The paper presents WeTok, a new visual tokenizer with innovative group-wise lookup-free quantization and a generative decoder, significantly improving compression and reconstruction fidelity.
Findings
Achieves record-low zero-shot rFID of 0.12 on ImageNet at high fidelity.
Outperforms existing tokenizers like FLUX-VAE and SD-VAE in reconstruction quality.
Maintains high reconstruction fidelity at 768× compression ratio, surpassing prior methods.
Abstract
Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoder (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The proposed GQ formulation provides a mathematically grounded way to reduce the entropy-loss memory bottleneck in LFQ and BSQ, with a provably smaller approximation error. 2. The paper includes large-scale ablations (quantization types, group numbers, architectures, learning schedules) and comparisons across both high-fidelity and high-compression regimes. 3. The proposed method achieves strong performance on both image reconstruction and AR-based generation results, even surpassing continuo
1. Diffusion-based decoder for visual reconstruction has been studied in previous literatures[1][2], it would be better to cite these work and further discuss the differences with them. 2. In the ablation study section, it's interesting to see that after converting the decoder to a generative model, the reconstructed images are more realistic. It would be better to include some further discussion or analysis. [1] Epsilon-VAE: Denoising as Visual Decoding [2] Diffusion Autoencoders are Scalable
The paper is well written. The reconstruction results achieve SOTA performance among existing discrete tokenizer, demonstrating the effectiveness of the proposed framework.
## Fairness of Comparison The comparison in Table 3 appears unfair. The strong baseline MGVQ is a VQ-based tokenizer, whereas WeTok adopts LSQ, which has already been shown to be more efficient than VQ. To ensure a fair evaluation, the authors should compare WeTok with an LSQ-based version of MGVQ. Furthermore, the MGVQ codebook size is only 8192 × 4, but its effective capacity is actually $2^{52}$, not limited by the nominal codebook size. ## Lack of Novelty The proposed method shows limit
(+) The results presented in the tables are relatively strong, achieving a much larger codebook size and good rFID and PSNR (+) The proposed two methods (GQ and GD) make sense and are well motivated. GQ seems to be a practical and effective solution for solving the bottleneck of the CE loss (+) The evaluation compared against many methods in Tables 3 and 4
(-) the GD method is not very novel. It is known to the community such diffusion decoder can work, dating back to OpenAI's "Consistency Decoder". In addition, such generative decoder does not come with no cost. First, the decoding time increases, which can limit some of the real-time or latency-sensitive applications. Second, as it is a generative model, the decoder could also hallucinate (-) lack of comparison with more state-of-the-art autoencoders. For example, infinity tokenizer (https://c
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
