Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training
Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

TL;DR
This paper introduces ZOMG, a zero-shot, open-vocabulary framework for segmenting human motion sequences into meaningful sub-actions without annotations, leveraging language models and soft masking for effective motion understanding.
Contribution
ZOMG is the first framework to perform open-vocabulary, zero-shot motion grounding without annotations, combining language-based decomposition and adaptive masking techniques.
Findings
Achieves +8.7% mAP improvement on HumanML3D benchmark.
Outperforms prior methods in motion grounding accuracy.
Enables annotation-free motion understanding for downstream tasks.
Abstract
Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications
