NativeTok: Native Visual Tokenization for Improved Image Generation
Bin Wu, Mengqi Huang, Weinan Jia, Zhendong Mao

TL;DR
NativeTok introduces a novel visual tokenization method that enforces causal dependencies, leading to improved image generation coherence by embedding relational constraints directly into token sequences.
Contribution
It proposes native visual tokenization with causal dependencies, and a new framework NativeTok combining MIT and MoCET for efficient, constrained image tokenization and generation.
Findings
Enhanced image reconstruction quality.
Better coherence in generated images.
Efficient training with Hierarchical Native Training.
Abstract
VQ-based image generation typically follows a two-stage pipeline: a tokenizer encodes images into discrete tokens, and a generative model learns their dependencies for reconstruction. However, improved tokenization in the first stage does not necessarily enhance the second-stage generation, as existing methods fail to constrain token dependencies. This mismatch forces the generative model to learn from unordered distributions, leading to bias and weak coherence. To address this, we propose native visual tokenization, which enforces causal dependencies during tokenization. Building on this idea, we introduce NativeTok, a framework that achieves efficient reconstruction while embedding relational constraints within token sequences. NativeTok consists of: (1) a Meta Image Transformer (MIT) for latent image modeling, and (2) a Mixture of Causal Expert Transformer (MoCET), where each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computer Graphics and Visualization Techniques · Cell Image Analysis Techniques
