OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Junke Wang; Yi Jiang; Zehuan Yuan; Binyue Peng; Zuxuan Wu; Yu-Gang; Jiang

arXiv:2406.09399·cs.CV·June 14, 2024

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang, Jiang

PDF

Open Access 1 Repo 1 Video

TL;DR

OmniTokenizer introduces a unified transformer-based tokenizer capable of jointly encoding images and videos, enabling improved reconstruction and synthesis performance across diverse visual datasets.

Contribution

It is the first to unify image and video tokenization in a single framework with a spatial-temporal decoupled architecture and progressive training strategy.

Findings

01

Achieves state-of-the-art reconstruction FID of 1.11 on ImageNet.

02

Attains 42 FVD on UCF-101, surpassing previous methods by 26%.

03

Enhances visual synthesis when integrated with language and diffusion models.

Abstract

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

foundationvision/omnitokenizer
pytorchOfficial

Videos

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Video Analysis and Summarization

MethodsDiffusion