TL;DR
MotionCLIP is a novel 3D human motion auto-encoder that aligns its latent space with CLIP, enabling rich semantic understanding, out-of-domain action generation, and advanced motion editing from textual descriptions.
Contribution
It introduces a transformer-based auto-encoder aligned with CLIP space, allowing semantic, out-of-domain, and disentangled motion generation from text prompts.
Findings
Enables text-to-motion generation for unseen actions
Supports motion editing and interpolation using semantic space
Achieves high-quality, semantically meaningful motion synthesis
Abstract
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label's position in CLIP-space. We further leverage CLIP's unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Language-Image Pre-training
