Learning Spatiotemporal Features via Video and Text Pair Discrimination
Tianhao Li, Limin Wang

TL;DR
This paper introduces a weakly-supervised learning framework that leverages video-text pairs to learn spatiotemporal features, achieving competitive and state-of-the-art results in action recognition tasks without extensive manual annotations.
Contribution
It proposes a novel cross-modal pair discrimination framework using noise-contrastive estimation and curriculum learning for efficient visual-textual feature learning.
Findings
Achieves competitive action classification results on Kinetics without fine-tuning.
Provides effective initialization for downstream action recognition tasks.
Sets new state-of-the-art in zero-shot action recognition on UCF101.
Abstract
Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
