DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control
Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao, Tang, Wenwu Zhu

TL;DR
DisenStudio is a novel framework that enables multi-subject, text-guided video generation with disentangled spatial control, addressing previous issues of subject-missing, attribute-binding, and action-binding in multi-subject scenarios.
Contribution
The paper introduces DisenStudio, a diffusion-based model with spatial-disentangled cross-attention and specialized finetuning strategies for effective multi-subject text-to-video generation.
Findings
Outperforms existing methods in multiple metrics.
Successfully generates videos with multiple subjects and desired actions.
Demonstrates versatility in controllable generation applications.
Abstract
Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Video Analysis and Summarization · Multimedia Communication and Technology
MethodsFocus
