Text Prompting for Multi-Concept Video Customization by Autoregressive Generation
Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender,, Dinesh Manocha, Mohammad Babaeizadeh

TL;DR
This paper introduces a novel autoregressive text prompting method for multi-concept video customization using pretrained text-to-video models, enabling complex scene generation with multiple subjects, actions, and backgrounds.
Contribution
It proposes a sequential, controlled approach to generate videos with multiple concepts by navigating the intersection of video manifolds through text prompts.
Findings
Successfully generates videos with multiple complex concepts.
Quantitative evaluation shows high scores in video quality and relevance.
Human evaluation confirms the effectiveness of the method.
Abstract
We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text prompting, leads to the solution. To do so, we generate the various concepts and their corresponding interactions, sequentially, in an autoregressive manner. Our method can generate videos of multiple custom concepts (subjects, action and background) such as a teddy bear running towards a brown teapot, a dog playing violin and a teddy bear swimming in the ocean. We quantitatively evaluate our method using videoCLIP and DINO scores, in addition to human evaluation. Videos for results presented…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimedia Communication and Technology
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Residual Connection · Multi-Head Attention · Dense Connections · Vision Transformer · self-DIstillation with NO labels
