Joint-Dataset Learning and Cross-Consistent Regularization for   Text-to-Motion Retrieval

Nicola Messina; Jan Sedmidubsky; Fabrizio Falchi; Tom\'a\v{s} Rebok

arXiv:2407.02104·cs.CV·July 3, 2024

Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval

Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi, Tom\'a\v{s} Rebok

PDF

Open Access

TL;DR

This paper introduces a joint-dataset learning approach with a novel regularization technique and a transformer-based motion encoder to improve text-to-motion retrieval accuracy across multiple datasets.

Contribution

It proposes a joint-dataset training framework with a Cross-Consistent Contrastive Loss and a new motion encoder, enhancing cross-dataset generalization in text-to-motion retrieval.

Findings

01

Improved retrieval performance on KIT Motion-Language and HumanML3D datasets.

02

Effective regularization via CCCL enhances cross-modal representation.

03

Demonstrated benefits of joint-dataset learning through ablation studies.

Abstract

Pose-estimation methods enable extracting human motion from common videos in the structured form of 3D skeleton sequences. Despite great application opportunities, effective content-based access to such spatio-temporal motion data is a challenging problem. In this paper, we focus on the recently introduced text-motion retrieval tasks, which aim to search for database motions that are the most relevant to a specified natural-language textual description (text-to-motion) and vice-versa (motion-to-text). Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models effectively. To address this issue, we propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously - together with the introduction of a Cross-Consistent Contrastive Loss function (CCCL),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Focus