Video-LaVIT: Unified Video-Language Pre-training with Decoupled   Visual-Motional Tokenization

Yang Jin; Zhicheng Sun; Kun Xu; Kun Xu; Liwei Chen; Hao Jiang; Quzhe; Huang; Chengru Song; Yuliang Liu; Di Zhang; Yang Song; Kun Gai; Yadong Mu

arXiv:2402.03161·cs.CV·June 4, 2024·5 cites

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu, Kun Xu, Liwei Chen, Hao Jiang, Quzhe, Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, Kun Gai, Yadong Mu

PDF

Open Access 1 Repo

TL;DR

Video-LaVIT introduces a unified pre-training framework for videos, images, and text by decomposing videos into keyframes and motions, enabling effective multimodal understanding and generation.

Contribution

It proposes a novel tokenization approach that discretizes visual and temporal information for unified video, image, and text pre-training with LLMs.

Findings

01

Achieves competitive results on 13 multimodal benchmarks.

02

Effectively models spatiotemporal dynamics in videos.

03

Enables high-quality video content generation.

Abstract

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jy0205/lavit
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging