Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian

TL;DR
This paper introduces LAVIMO, a three-modality learning framework that integrates text, video, and motion data to improve human motion retrieval, achieving state-of-the-art results on multiple datasets.
Contribution
The work presents a novel three-modality learning framework with a specialized attention mechanism for enhanced alignment among text, video, and motion modalities.
Findings
Achieves state-of-the-art performance on HumanML3D and KIT-ML datasets.
Effectively bridges the gap between text and motion through video integration.
Improves cross-modal retrieval accuracy across multiple modality pairs.
Abstract
Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Human Motion and Animation
