Cross-modal Representation Learning for Zero-shot Action Recognition

Chung-Ching Lin; Kevin Lin; Linjie Li; Lijuan Wang; Zicheng Liu

arXiv:2205.01657·cs.CV·May 4, 2022

Cross-modal Representation Learning for Zero-shot Action Recognition

Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu

PDF

Open Access

TL;DR

This paper introduces a cross-modal Transformer framework that jointly encodes video and text for zero-shot action recognition, improving accuracy by learning shared visual-semantic representations without extra pre-training.

Contribution

It proposes a novel end-to-end pipeline for learning discriminative visual and semantic features in a shared space, enhancing zero-shot recognition performance.

Findings

01

Significant improvement over state-of-the-art in ZSAR accuracy.

02

Effective semantic transfer scheme for unseen classes.

03

No additional dataset pre-training required.

Abstract

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning