MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation
Gabriel Maldonado, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi

TL;DR
MoCLIP is a novel fine-tuned CLIP model with a motion encoding head that improves human motion generation by capturing motion dynamics more effectively, leading to better alignment and fidelity in text-to-motion tasks.
Contribution
The paper introduces MoCLIP, a motion-aware extension of CLIP that incorporates motion encoding and contrastive training, enhancing motion generation capabilities.
Findings
Improves Top-1, Top-2, and Top-3 accuracy in motion tasks.
Maintains competitive FID scores.
Enhances motion fidelity and text-to-motion alignment.
Abstract
Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsContrastive Learning · Contrastive Language-Image Pre-training
