Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Shiyao Yu; Zi-An Wang; Kangning Yin; Zheng Tian; Mingyuan Zhang; Weixin Si; Shihao Zou

arXiv:2507.23188·cs.CV·August 1, 2025

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang, Weixin Si, Shihao Zou

PDF

Open Access

TL;DR

This paper introduces a multi-modal motion retrieval framework that aligns text, audio, video, and motion in a fine-grained joint embedding space, improving retrieval accuracy and user interaction.

Contribution

It is the first to incorporate audio into motion retrieval and uses sequence-level contrastive learning for better multi-modal alignment.

Findings

01

10.16% improvement in R@10 for text-to-motion retrieval

02

25.43% improvement in R@1 for video-to-motion retrieval

03

Multi-modal framework outperforms 3-modal counterparts

Abstract

Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Human Motion and Animation