MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

Gabriel Maldonado; Armin Danesh Pazho; Ghazal Alinezhad Noghre; Vinit Katariya; Hamed Tabkhi

arXiv:2505.10810·cs.CV·May 19, 2025

MoCLIP: Motion-Aware Fine-Tuning and Distillation of CLIP for Human Motion Generation

Gabriel Maldonado, Armin Danesh Pazho, Ghazal Alinezhad Noghre, Vinit Katariya, Hamed Tabkhi

PDF

Open Access

TL;DR

MoCLIP is a novel fine-tuned CLIP model with a motion encoding head that improves human motion generation by capturing motion dynamics more effectively, leading to better alignment and fidelity in text-to-motion tasks.

Contribution

The paper introduces MoCLIP, a motion-aware extension of CLIP that incorporates motion encoding and contrastive training, enhancing motion generation capabilities.

Findings

01

Improves Top-1, Top-2, and Top-3 accuracy in motion tasks.

02

Maintains competitive FID scores.

03

Enhances motion fidelity and text-to-motion alignment.

Abstract

Human motion generation is essential for fields such as animation, robotics, and virtual reality, requiring models that effectively capture motion dynamics from text descriptions. Existing approaches often rely on Contrastive Language-Image Pretraining (CLIP)-based text encoders, but their training on text-image pairs constrains their ability to understand temporal and kinematic structures inherent in motion and motion generation. This work introduces MoCLIP, a fine-tuned CLIP model with an additional motion encoding head, trained on motion sequences using contrastive learning and tethering loss. By explicitly incorporating motion-aware representations, MoCLIP enhances motion fidelity while remaining compatible with existing CLIP-based pipelines and seamlessly integrating into various CLIP-based methods. Experiments demonstrate that MoCLIP improves Top-1, Top-2, and Top-3 accuracy while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsContrastive Learning · Contrastive Language-Image Pre-training