DINO-Tok: Adapting DINO for Visual Tokenizers
Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin

TL;DR
DINO-Tok introduces a novel visual tokenizer based on a frozen DINO encoder, unifying hierarchical features for high-fidelity, semantically consistent image generation, addressing challenges in high-dimensional latent spaces.
Contribution
It presents DINO-Tok, a new visual tokenizer that combines continuous autoencoding and vector quantization using a frozen DINO encoder, and proposes Dominant-Subspace Quantization to improve codebook stability.
Findings
Achieves 0.28 rFID in autoencoding on ImageNet 256x256
Attains 1.10 rFID with VQ for high-fidelity reconstruction
Demonstrates strong few-step generation performance with 1.82 gFID
Abstract
Recent advances in visual generation have emphasized the importance of Latent Generative Models (LGMs), which critically depend on effective visual tokenizers to bridge pixels and semantic representations. However, tokenizers constructed on pre-trained vision foundation models (VFMs) often struggle to balance semantic richness and reconstruction fidelity in high-dimensional latent spaces. In this paper, we introduce DINO-Tok, a visual tokenizer built upon a frozen DINO encoder that supports both continuous autoencoding (DINO-Tok-AE) and discrete vector-quantization (DINO-Tok-VQ). By unifying hierarchical representations from both shallow fine-grained features and deep global semantics into an information-complete latent space, DINO-Tok preserves texture details while maintaining \textit{semantic consistency} for generation. We further investigate VQ in frozen semantic feature spaces of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
