Composable Visual Tokenizers with Generator-Free Diagnostics of Learnability
Bingchen Zhao, Qiushan Guo, Ye Wang, Yixuan Huang, Zhonghua Zhai, Yu Tian

TL;DR
This paper presents CompTok, a novel training framework for visual tokenizers that enhances compositionality and learnability, enabling high-quality image generation and semantic editing through a generator-free diagnostic approach.
Contribution
CompTok introduces a diffusion-based tokenizer training method with a recognition model and manifold constraints, improving compositionality and enabling diagnostics of learnability without generators.
Findings
Achieves state-of-the-art performance on image class-conditioned generation.
Enables semantic editing by swapping tokens between images.
Provides metrics to measure token space compositionality and learnability.
Abstract
We introduce CompTok, a training framework for learning visual tokenizers whose tokens are enhanced for compositionality. CompTok uses a token-conditioned diffusion decoder. By employing an InfoGAN-style objective, where we train a recognition model to predict the tokens used to condition the diffusion decoder using the decoded images, we enforce the decoder to not ignore any of the tokens. To promote compositional control, besides the original images, CompTok also trains on tokens formed by swapping token subsets between images, enabling more compositional control of the token over the decoder. As the swapped tokens between images do not have ground truth image targets, we apply a manifold constraint via an adversarial flow regularizer to keep unpaired swap generations on the natural-image distribution. The resulting tokenizer not only achieves state-of-the-art performance on image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Cell Image Analysis Techniques · Adversarial Robustness in Machine Learning
