Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Yao Zhang; Zhuchenyang Liu; Yanlan He; Thomas Ploetz; Yu Xiao

arXiv:2603.09930·cs.CV·March 11, 2026

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction

Yao Zhang, Zhuchenyang Liu, Yanlan He, Thomas Ploetz, Yu Xiao

PDF

Open Access

TL;DR

This paper introduces a novel interpretable approach for text-to-motion retrieval that leverages joint-angle motion images and token-patch late interaction, significantly improving accuracy and interpretability over existing global-embedding methods.

Contribution

It proposes a joint-angle-based motion representation and a token-wise late interaction mechanism with regularization, enabling fine-grained, interpretable text-motion alignment.

Findings

01

Outperforms state-of-the-art retrieval methods on HumanML3D and KIT-ML datasets.

02

Provides interpretable, fine-grained correspondences between text and motion.

03

Enhances robustness with Masked Language Modeling regularization.

Abstract

Text-motion retrieval aims to learn a semantically aligned latent space between natural language descriptions and 3D human motion skeleton sequences, enabling bidirectional search across the two modalities. Most existing methods use a dual-encoder framework that compresses motion and text into global embeddings, discarding fine-grained local correspondences, and thus reducing accuracy. Additionally, these global-embedding methods offer limited interpretability of the retrieval results. To overcome these limitations, we propose an interpretable, joint-angle-based motion representation that maps joint-level local features into a structured pseudo-image, compatible with pre-trained Vision Transformers. For text-to-motion retrieval, we employ MaxSim, a token-wise late interaction mechanism, and enhance it with Masked Language Modeling regularization to foster robust, interpretable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · Multimodal Machine Learning Applications