Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao

TL;DR
This paper introduces a novel interpretable approach for text-to-motion retrieval that leverages joint-angle motion images and token-patch late interaction, significantly improving accuracy and interpretability over existing global-embedding methods.
Contribution
It proposes a joint-angle-based motion representation and a token-wise late interaction mechanism with regularization, enabling fine-grained, interpretable text-motion alignment.
Findings
Outperforms state-of-the-art retrieval methods on HumanML3D and KIT-ML datasets.
Provides interpretable, fine-grained correspondences between text and motion.
Enhances robustness with Masked Language Modeling regularization.
Abstract
Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Multimodal Machine Learning Applications
