CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

Valter Estevam; Rayson Laroca; Helio Pedrini; David Menotti

arXiv:2605.01165·cs.CV·May 5, 2026

CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

Valter Estevam, Rayson Laroca, Helio Pedrini, David Menotti

PDF

1 Repo

TL;DR

This paper introduces CEZSAR, a contrastive learning approach for zero-shot action recognition that aligns videos and text descriptions in a joint embedding space, achieving state-of-the-art results.

Contribution

It proposes a novel joint embedding model with automatic negative sampling to improve zero-shot action recognition performance.

Findings

01

Achieves state-of-the-art results on UCF-101 and Kinetics-400 datasets.

02

Introduces an automatic negative sampling procedure for training.

03

Effectively addresses semantic gap and domain shift in ZSAR.

Abstract

This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

valterlej/cezsar
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.