LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and   Captioning

Zhe Li; Weihao Yuan; Yisheng He; Lingteng Qiu; Shenhao Zhu; Xiaodong; Gu; Weichao Shen; Yuan Dong; Zilong Dong; Laurence T. Yang

arXiv:2410.07093·cs.CV·March 11, 2025

LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning

Zhe Li, Weihao Yuan, Yisheng He, Lingteng Qiu, Shenhao Zhu, Xiaodong, Gu, Weichao Shen, Yuan Dong, Zilong Dong, Laurence T. Yang

PDF

Open Access

TL;DR

LaMP introduces a novel language-motion pretraining framework that significantly improves motion generation, retrieval, and captioning by creating a more aligned and informative language-motion latent space, surpassing previous CLIP-based methods.

Contribution

This work presents LaMP, a new pretraining model that transitions from static image-text embeddings to a dynamic language-motion space, enhancing task performance in motion-related applications.

Findings

01

Substantial improvements in motion generation, retrieval, and captioning tasks.

02

Introduction of LaMP-BertScore for better motion-text alignment evaluation.

03

Effective motion-informative text embeddings enhance relevance and semantics.

Abstract

Language plays a vital role in the realm of human motion. Existing methods have largely depended on CLIP text embeddings for motion generation, yet they fall short in effectively aligning language and motion due to CLIP's pretraining on static image-text pairs. This work introduces LaMP, a novel Language-Motion Pretraining model, which transitions from a language-vision to a more suitable language-motion latent space. It addresses key limitations by generating motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences. With LaMP, we advance three key tasks: text-to-motion generation, motion-text retrieval, and motion captioning through aligned language-motion representation learning. For generation, we utilize LaMP to provide the text condition instead of CLIP, and an autoregressive masked prediction is designed to achieve mask…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation

MethodsContrastive Language-Image Pre-training