Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Yuying Ge; Yizhuo Li; Yixiao Ge; Ying Shan

arXiv:2412.04432·cs.CV·December 6, 2024

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan

PDF

Open Access 1 Repo 1 Models

TL;DR

Divot introduces a diffusion-powered video tokenizer that captures spatial and temporal features for improved video comprehension and generation within large language models, enabling realistic video synthesis and understanding.

Contribution

It presents the first diffusion-based video tokenizer that effectively encodes and decodes videos for LLM integration, advancing video understanding and generation capabilities.

Findings

01

Achieves competitive performance on video benchmarks.

02

Enables high-quality text-to-video generation.

03

Demonstrates effective video storytelling with Divot-Vicuna.

Abstract

In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencentarc/divot
pytorchOfficial

Models

🤗
TencentARC/Divot
model· 1 dl· ♡ 7
1 dl♡ 7

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsDiffusion