Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment
Jidong Kuang, Hongsong Wang, Chaolei Han, Yang Zhang, Jie Gui

TL;DR
This paper introduces a dual alignment approach with semantic enhancement for skeleton-based zero-shot action recognition, significantly improving the alignment between visual and textual features to recognize unseen actions.
Contribution
It proposes a novel Dual Visual-Text Alignment (DVTA) framework with direct and augmented modules, plus semantic description enhancement, to improve zero-shot action recognition accuracy.
Findings
Achieves state-of-the-art results on multiple benchmarks.
Effectively reduces modality gap through dual alignment modules.
Enhances semantic connection with cross-attention based description enhancement.
Abstract
Zero-shot action recognition, which addresses the issue of scalability and generalization in action recognition and allows the models to adapt to new and unseen actions dynamically, is an important research topic in computer vision communities. The key to zero-shot action recognition lies in aligning visual features with semantic vectors representing action categories. Most existing methods either directly project visual features onto the semantic space of text category or learn a shared embedding space between the two modalities. However, a direct projection cannot accurately align the two modalities, and learning robust and discriminative embedding space between visual and text representations is often difficult. To address these issues, we introduce Dual Visual-Text Alignment (DVTA) for skeleton-based zero-shot action recognition. The DVTA consists of two alignment modules--Direct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Multimodal Machine Learning Applications
MethodsALIGN
