Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou; Xinyan Chen; Junlang Qian; Lihua Xie; Jianfei Yang

arXiv:2511.15379·cs.CV·November 20, 2025

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training

Yunjiao Zhou, Xinyan Chen, Junlang Qian, Lihua Xie, Jianfei Yang

PDF

Open Access 1 Video

TL;DR

This paper introduces ZOMG, a zero-shot, open-vocabulary framework for segmenting human motion sequences into meaningful sub-actions without annotations, leveraging language models and soft masking for effective motion understanding.

Contribution

ZOMG is the first framework to perform open-vocabulary, zero-shot motion grounding without annotations, combining language-based decomposition and adaptive masking techniques.

Findings

01

Achieves +8.7% mAP improvement on HumanML3D benchmark.

02

Outperforms prior methods in motion grounding accuracy.

03

Enables annotation-free motion understanding for downstream tasks.

Abstract

Understanding complex human activities demands the ability to decompose motion into fine-grained, semantic-aligned sub-actions. This motion grounding process is crucial for behavior analysis, embodied AI and virtual reality. Yet, most existing methods rely on dense supervision with predefined action classes, which are infeasible in open-vocabulary, real-world settings. In this paper, we propose ZOMG, a zero-shot, open-vocabulary framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning. Technically, ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization, which learns instance-specific temporal masks to focus on frames critical to sub-actions, while maintaining intra-segment continuity and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training· underline

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications