Towards Multi-Task Multi-Modal Models: A Video Generative Perspective
Lijun Yu

TL;DR
This paper presents a comprehensive approach to multi-task, multi-modal video generation and understanding, introducing novel tokenization, scalable models, and surpassing diffusion models in visual synthesis.
Contribution
It introduces a scalable multi-modal transformer, novel video tokenizers, and demonstrates language models surpassing diffusion models in visual synthesis, advancing multi-task video generation and understanding.
Findings
Language models outperform diffusion models in visual synthesis.
Video tokenizer outperforms industry-standard codecs.
Scalable models enable high-fidelity, multi-modal video generation.
Abstract
Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions, as well as for understanding and compression applications. Given the high dimensionality of visual data, we pursue concise and accurate latent representations. Our video-native spatial-temporal tokenizers preserve high fidelity. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Furthermore, our scalable visual token representation proves beneficial across generation, compression, and understanding tasks. This achievement marks the first instances of language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Video Analysis and Summarization
MethodsDiffusion
