TL;DR
This paper proposes a novel zero-shot action recognition method utilizing semantic relationships between actions, objects, and descriptive sentences, achieving state-of-the-art results on benchmark datasets by combining global classifiers based on sentence embeddings.
Contribution
It introduces a new ZSAR approach that leverages descriptive sentences for object-action affinity estimation and probabilistic action classification without human labeling.
Findings
Achieves state-of-the-art results on Kinetics-400 dataset.
Competitive performance on UCF-101 dataset.
Demonstrates effective use of sentence-based semantic representations.
Abstract
The success of Zero-shot Action Recognition (ZSAR) methods is intrinsically related to the nature of semantic side information used to transfer knowledge, although this aspect has not been primarily investigated in the literature. This work introduces a new ZSAR method based on the relationships of actions-objects and actions-descriptive sentences. We demonstrate that representing all object classes using descriptive sentences generates an accurate object-action affinity estimation when a paraphrase estimation method is used as an embedder. We also show how to estimate probabilities over the set of action classes based only on a set of sentences without hard human labeling. In our method, the probabilities from these two global classifiers (i.e., which use features computed over the entire video) are combined, producing an efficient transfer knowledge model for action classification.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
