Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen; Guangtao Lyu; Chenghao Xu; Jiexi Yan; Xu Yang; Cheng Deng

arXiv:2601.21904·cs.CV·February 5, 2026

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning

Hanmo Chen, Guangtao Lyu, Chenghao Xu, Jiexi Yan, Xu Yang, Cheng Deng

PDF

Open Access

TL;DR

This paper introduces a pyramidal learning framework for fine-grained motion-language retrieval, capturing hierarchical local interactions between motion segments, joints, and text tokens to improve alignment accuracy.

Contribution

It proposes a novel Pyramidal Shapley-Taylor framework that decomposes motion into segments and joints for hierarchical cross-modal alignment, advancing beyond global methods.

Findings

01

Outperforms state-of-the-art on benchmark datasets

02

Achieves precise local motion-text alignment

03

Effectively captures hierarchical motion semantics

Abstract

As a foundational task in human-centric cross-modal intelligence, motion-language retrieval aims to bridge the semantic gap between natural language and human motion, enabling intuitive motion analysis, yet existing approaches predominantly focus on aligning entire motion sequences with global textual representations. This global-centric paradigm overlooks fine-grained interactions between local motion segments and individual body joints and text tokens, inevitably leading to suboptimal retrieval performance. To address this limitation, we draw inspiration from the pyramidal process of human motion perception (from joint dynamics to segment coherence, and finally to holistic comprehension) and propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval. Specifically, the framework decomposes human motion into temporal segments and spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Human Pose and Action Recognition