TL;DR
This paper introduces PGFA, a novel end-to-end framework for zero-shot skeleton-based action recognition that enhances skeleton-text alignment and mitigates distribution bias, significantly outperforming previous methods.
Contribution
The paper proposes a prototype-guided feature alignment approach with an end-to-end contrastive training framework for improved zero-shot action recognition.
Findings
Achieves up to 22.96% accuracy improvement on NTU-60 dataset.
Demonstrates superior performance over existing methods on three datasets.
Provides theoretical analysis supporting the proposed alignment strategy.
Abstract
Zero-shot skeleton-based action recognition aims to classify unseen skeleton-based human actions without prior exposure to such categories during training. This task is extremely challenging due to the difficulty in generalizing from known to unknown actions. Previous studies typically use two-stage training: pre-training skeleton encoders on seen action categories using cross-entropy loss and then aligning pre-extracted skeleton and text features, enabling knowledge transfer to unseen classes through skeleton-text alignment and language models' generalization. However, their efficacy is hindered by 1) insufficient discrimination for skeleton features, as the fixed skeleton encoder fails to capture necessary alignment information for effective skeleton-text alignment; 2) the neglect of alignment bias between skeleton and unseen text features during testing. To this end, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
