CoMo: Compositional Motion Customization for Text-to-Video Generation

Youcan Xu; Zhen Wang; Jiaxin Shi; Kexin Li; Feifei Shao; Jun Xiao; Yi Yang; Jun Yu; Long Chen

arXiv:2510.23007·cs.CV·October 28, 2025

CoMo: Compositional Motion Customization for Text-to-Video Generation

Youcan Xu, Zhen Wang, Jiaxin Shi, Kexin Li, Feifei Shao, Jun Xiao, Yi Yang, Jun Yu, Long Chen

PDF

TL;DR

CoMo introduces a novel framework for compositional motion customization in text-to-video generation, enabling the synthesis of videos with multiple distinct motions through a two-phase approach that disentangles and combines motions without additional training.

Contribution

The paper presents CoMo, a new method that allows for multi-motion synthesis in text-to-video generation by disentangling and compositing motions, addressing previous limitations in motion control.

Findings

01

Achieves state-of-the-art performance in multi-motion video synthesis.

02

Introduces a new benchmark and evaluation metric for multi-motion fidelity.

03

Demonstrates effective disentanglement and blending of multiple motions.

Abstract

While recent text-to-video models excel at generating diverse scenes, they struggle with precise motion control, particularly for complex, multi-subject motions. Although methods for single-motion customization have been developed to address this gap, they fail in compositional scenarios due to two primary challenges: motion-appearance entanglement and ineffective multi-motion blending. This paper introduces CoMo, a novel framework for $compositional motion customization$ in text-to-video generation, enabling the synthesis of multiple, distinct motions within a single video. CoMo addresses these issues through a two-phase approach. First, in the single-motion learning phase, a static-dynamic decoupled tuning paradigm disentangles motion from appearance to learn a motion-specific module. Second, in the multi-motion composition phase, a plug-and-play divide-and-merge strategy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.