VideoDreamer: Customized Multi-Subject Text-to-Video Generation with   Disen-Mix Finetuning on Language-Video Foundation Models

Hong Chen; Xin Wang; Guanning Zeng; Yipeng Zhang; Yuwei Zhou; Feilin; Han; Yaofei Wu; Wenwu Zhu

arXiv:2311.00990·cs.CV·April 15, 2025·6 cites

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models

Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin, Han, Yaofei Wu, Wenwu Zhu

PDF

Open Access

TL;DR

VideoDreamer introduces a novel framework for multi-subject, text-guided video generation that maintains temporal consistency and visual fidelity of multiple subjects, addressing a significant gap in personalized video synthesis.

Contribution

The paper presents Disen-Mix Finetuning and Human-in-the-Loop strategies for multi-subject customization, along with a disentangled motion approach, advancing personalized multi-subject video generation.

Findings

01

Effective multi-subject video generation with preserved visual features.

02

Ability to generate videos with new content, events, and backgrounds.

03

Introduction of the MultiStudioBench benchmark for evaluation.

Abstract

Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation, leaving the more challenging problem of customized multi-subject generation unexplored. In this paper, we fill this gap and propose a novel VideoDreamer framework, which can generate temporally consistent text-guided videos that faithfully preserve the visual features of the given multiple subjects. Specifically, VideoDreamer adopts the pretrained Stable Diffusion with temporal modules as its base video generator, taking the power of the text-to-image model to generate diversified content. The video generator is further customized for multi-subjects, which leverages the proposed Disen-Mix Finetuning and Human-in-the-Loop Re-finetuning strategy, to tackle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Motion and Animation · Video Analysis and Summarization

MethodsDiffusion · Balanced Selection