Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Lijun Yu

arXiv:2405.16728·cs.CV·May 28, 2024

Towards Multi-Task Multi-Modal Models: A Video Generative Perspective

Lijun Yu

PDF

Open Access

TL;DR

This paper presents a comprehensive approach to multi-task, multi-modal video generation and understanding, introducing novel tokenization, scalable models, and surpassing diffusion models in visual synthesis.

Contribution

It introduces a scalable multi-modal transformer, novel video tokenizers, and demonstrates language models surpassing diffusion models in visual synthesis, advancing multi-task video generation and understanding.

Findings

01

Language models outperform diffusion models in visual synthesis.

02

Video tokenizer outperforms industry-standard codecs.

03

Scalable models enable high-fidelity, multi-modal video generation.

Abstract

Advancements in language foundation models have primarily fueled the recent surge in artificial intelligence. In contrast, generative learning of non-textual modalities, especially videos, significantly trails behind language modeling. This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions, as well as for understanding and compression applications. Given the high dimensionality of visual data, we pursue concise and accurate latent representations. Our video-native spatial-temporal tokenizers preserve high fidelity. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Furthermore, our scalable visual token representation proves beneficial across generation, compression, and understanding tasks. This achievement marks the first instances of language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Video Analysis and Summarization

MethodsDiffusion