Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Jidong Kuang; Hongsong Wang; Chaolei Han; Yang Zhang; Jie Gui

arXiv:2409.14336·cs.CV·August 25, 2025

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Jidong Kuang, Hongsong Wang, Chaolei Han, Yang Zhang, Jie Gui

PDF

Open Access

TL;DR

This paper introduces a dual alignment approach with semantic enhancement for skeleton-based zero-shot action recognition, significantly improving the alignment between visual and textual features to recognize unseen actions.

Contribution

It proposes a novel Dual Visual-Text Alignment (DVTA) framework with direct and augmented modules, plus semantic description enhancement, to improve zero-shot action recognition accuracy.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively reduces modality gap through dual alignment modules.

03

Enhances semantic connection with cross-attention based description enhancement.

Abstract

Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules--Direct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications

MethodsALIGN