TL;DR
This paper introduces CEZSAR, a contrastive learning approach for zero-shot action recognition that aligns videos and text descriptions in a joint embedding space, achieving state-of-the-art results.
Contribution
It proposes a novel joint embedding model with automatic negative sampling to improve zero-shot action recognition performance.
Findings
Achieves state-of-the-art results on UCF-101 and Kinetics-400 datasets.
Introduces an automatic negative sampling procedure for training.
Effectively addresses semantic gap and domain shift in ZSAR.
Abstract
This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
