CoMo: Compositional Motion Customization for Text-to-Video Generation
Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen

TL;DR
CoMo introduces a novel framework for compositional motion customization in text-to-video generation, enabling the synthesis of videos with multiple distinct motions through a two-phase approach that disentangles and combines motions without additional training.
Contribution
The paper presents CoMo, a new method that allows for multi-motion synthesis in text-to-video generation by disentangling and compositing motions, addressing previous limitations in motion control.
Findings
Achieves state-of-the-art performance in multi-motion video synthesis.
Introduces a new benchmark and evaluation metric for multi-motion fidelity.
Demonstrates effective disentanglement and blending of multiple motions.
Abstract
While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
