Text Prompting for Multi-Concept Video Customization by Autoregressive   Generation

Divya Kothandaraman; Kihyuk Sohn; Ruben Villegas; Paul Voigtlaender,; Dinesh Manocha; Mohammad Babaeizadeh

arXiv:2405.13951·cs.CV·May 24, 2024

Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender,, Dinesh Manocha, Mohammad Babaeizadeh

PDF

Open Access

TL;DR

This paper introduces a novel autoregressive text prompting method for multi-concept video customization using pretrained text-to-video models, enabling complex scene generation with multiple subjects, actions, and backgrounds.

Contribution

It proposes a sequential, controlled approach to generate videos with multiple concepts by navigating the intersection of video manifolds through text prompts.

Findings

01

Successfully generates videos with multiple complex concepts.

02

Quantitative evaluation shows high scores in video quality and relevance.

03

Human evaluation confirms the effectiveness of the method.

Abstract

We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimedia Communication and Technology

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels