Cross-modal Representation Learning for Zero-shot Action Recognition
Chung-Ching Lin, Kevin Lin, Linjie Li, Lijuan Wang, Zicheng Liu

TL;DR
This paper introduces a cross-modal Transformer framework that jointly encodes video and text for zero-shot action recognition, improving accuracy by learning shared visual-semantic representations without extra pre-training.
Contribution
It proposes a novel end-to-end pipeline for learning discriminative visual and semantic features in a shared space, enhancing zero-shot recognition performance.
Findings
Significant improvement over state-of-the-art in ZSAR accuracy.
Effective semantic transfer scheme for unseen classes.
No additional dataset pre-training required.
Abstract
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
