TL;DR
SynSE is a novel syntactically guided generative model for zero-shot skeleton action recognition that leverages inter-modal constraints with language and visual data, achieving state-of-the-art results on large-scale datasets.
Contribution
The paper introduces SynSE, the first zero-shot skeleton action recognition method that uses syntactic guidance and inter-modal constraints to improve generalization to unseen actions.
Findings
SynSE outperforms strong baselines on NTU-60 and NTU-120 datasets.
The approach generalizes compositionally to unseen words in action descriptions.
SynSE achieves state-of-the-art results in both ZSL and GZSL settings.
Abstract
We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large-scale NTU-60…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
