OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation
Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang, Jiang

TL;DR
OmniTokenizer introduces a unified transformer-based tokenizer capable of jointly encoding images and videos, enabling improved reconstruction and synthesis performance across diverse visual datasets.
Contribution
It is the first to unify image and video tokenization in a single framework with a spatial-temporal decoupled architecture and progressive training strategy.
Findings
Achieves state-of-the-art reconstruction FID of 1.11 on ImageNet.
Attains 42 FVD on UCF-101, surpassing previous methods by 26%.
Enhances visual synthesis when integrated with language and diffusion models.
Abstract
Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Video Analysis and Summarization
MethodsDiffusion
