DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition

Jingmin Zhu; Anqi Zhu; James Bailey; Jun Liu; Hossein Rahmani; Mohammed Bennamoun; Farid Boussaid; and Qiuhong Ke

arXiv:2512.11941·cs.CV·December 16, 2025

DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition

Jingmin Zhu, Anqi Zhu, James Bailey, Jun Liu, Hossein Rahmani, Mohammed Bennamoun, Farid Boussaid, and Qiuhong Ke

PDF

Open Access

TL;DR

DynaPURLS introduces a dynamic, multi-scale approach for skeleton-based zero-shot action recognition, utilizing hierarchical textual descriptions and adaptive visual-semantic alignment to improve generalization to unseen classes.

Contribution

The paper proposes DynaPURLS, a novel framework that dynamically refines visual-semantic correspondences at inference time using large language models and adaptive partitioning, addressing domain shift in zero-shot recognition.

Findings

01

Achieves state-of-the-art results on NTU RGB+D 60/120 and PKU-MMD datasets.

02

Effectively mitigates domain shift through dynamic refinement and confidence-aware memory bank.

03

Significantly improves zero-shot recognition accuracy over prior methods.

Abstract

Zero-shot skeleton-based action recognition (ZS-SAR) is fundamentally constrained by prevailing approaches that rely on aligning skeleton features with static, class-level semantics. This coarse-grained alignment fails to bridge the domain shift between seen and unseen classes, thereby impeding the effective transfer of fine-grained visual knowledge. To address these limitations, we introduce \textbf{DynaPURLS}, a unified framework that establishes robust, multi-scale visual-semantic correspondences and dynamically refines them at inference time to enhance generalization. Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics. Concurrently, an adaptive partitioning module produces fine-grained visual representations by semantically grouping skeleton joints. To fortify this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning