DisenStudio: Customized Multi-subject Text-to-Video Generation with   Disentangled Spatial Control

Hong Chen; Xin Wang; Yipeng Zhang; Yuwei Zhou; Zeyang Zhang; Siao; Tang; Wenwu Zhu

arXiv:2405.12796·cs.CV·May 22, 2024

DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao, Tang, Wenwu Zhu

PDF

Open Access

TL;DR

DisenStudio is a novel framework that enables multi-subject, text-guided video generation with disentangled spatial control, addressing previous issues of subject-missing, attribute-binding, and action-binding in multi-subject scenarios.

Contribution

The paper introduces DisenStudio, a diffusion-based model with spatial-disentangled cross-attention and specialized finetuning strategies for effective multi-subject text-to-video generation.

Findings

01

Outperforms existing methods in multiple metrics.

02

Successfully generates videos with multiple subjects and desired actions.

03

Demonstrates versatility in controllable generation applications.

Abstract

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Video Analysis and Summarization · Multimedia Communication and Technology

MethodsFocus