DINO-Tok: Adapting DINO for Visual Tokenizers

Mingkai Jia; Mingxiao Li; Zhijian Shu; Anlin Zheng; Liaoyuan Fan; Jiaxin Guo; Tianxing Shi; Dongyue Lu; Zeming Li; Xiaoyang Guo; Xiaojuan Qi; Xiao-Xiao Long; Qian Zhang; Ping Tan; Wei Yin

arXiv:2511.20565·cs.CV·March 25, 2026

DINO-Tok: Adapting DINO for Visual Tokenizers

Mingkai Jia, Mingxiao Li, Zhijian Shu, Anlin Zheng, Liaoyuan Fan, Jiaxin Guo, Tianxing Shi, Dongyue Lu, Zeming Li, Xiaoyang Guo, Xiaojuan Qi, Xiao-Xiao Long, Qian Zhang, Ping Tan, Wei Yin

PDF

Open Access

TL;DR

DINO-Tok introduces a novel visual tokenizer based on a frozen DINO encoder, unifying hierarchical features for high-fidelity, semantically consistent image generation, addressing challenges in high-dimensional latent spaces.

Contribution

It presents DINO-Tok, a new visual tokenizer that combines continuous autoencoding and vector quantization using a frozen DINO encoder, and proposes Dominant-Subspace Quantization to improve codebook stability.

Findings

01

Achieves 0.28 rFID in autoencoding on ImageNet 256x256

02

Attains 1.10 rFID with VQ for high-fidelity reconstruction

03

Demonstrates strong few-step generation performance with 1.82 gFID

Abstract

Recent advances in visual generation have emphasized the importance of Latent Generative Models (LGMs), which critically depend on effective visual tokenizers to bridge pixels and semantic representations. However, tokenizers constructed on pre-trained vision foundation models (VFMs) often struggle to balance semantic richness and reconstruction fidelity in high-dimensional latent spaces. In this paper, we introduce DINO-Tok, a visual tokenizer built upon a frozen DINO encoder that supports both continuous autoencoding (DINO-Tok-AE) and discrete vector-quantization (DINO-Tok-VQ). By unifying hierarchical representations from both shallow fine-grained features and deep global semantics into an information-complete latent space, DINO-Tok preserves texture details while maintaining \textit{semantic consistency} for generation. We further investigate VQ in frozen semantic feature spaces of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis