Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

Huijie Liu; Jingyun Wang; Shuai Ma; Jie Hu; Xiaoming Wei; Guoliang Kang

arXiv:2501.16714·cs.CV·September 9, 2025

Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models

Huijie Liu, Jingyun Wang, Shuai Ma, Jie Hu, Xiaoming Wei, Guoliang Kang

PDF

Open Access

TL;DR

This paper introduces novel strategies to improve motion-appearance separation in text-to-video diffusion models, enabling more accurate motion customization without sacrificing appearance diversity.

Contribution

It proposes two new techniques, temporal attention purification and appearance highway, to better disentangle motion from appearance in diffusion model adaptation.

Findings

01

Enhanced motion-appearance separation in generated videos.

02

Improved alignment of generated video appearance with text descriptions.

03

More consistent motion with reference videos.

Abstract

Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. To realize this goal, the adaptation of DM should be possible to model the specified motion concept, without compromising the ability to generate diverse appearances. Thus, the key to solving this problem lies in how to separate the motion concept from the appearance in the adaptation process of DM. Typical previous works explore different ways to represent and insert a motion concept into large-scale pretrained text-to-video diffusion models, e.g., learning a motion LoRA, using latent noise residuals, etc. While those methods can encode the motion concept, they also inevitably encode the appearance in the reference videos, resulting in weakened appearance generation capability. In this paper, we follow the typical way to learn a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language · Face recognition and analysis

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Softmax · Attention Is All You Need · Max Pooling · Convolution · Concatenated Skip Connection · U-Net · Sparse Evolutionary Training · Diffusion