Learning Spatiotemporal Features via Video and Text Pair Discrimination

Tianhao Li; Limin Wang

arXiv:2001.05691·cs.CV·January 29, 2021·33 cites

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Tianhao Li, Limin Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a weakly-supervised learning framework that leverages video-text pairs to learn spatiotemporal features, achieving competitive and state-of-the-art results in action recognition tasks without extensive manual annotations.

Contribution

It proposes a novel cross-modal pair discrimination framework using noise-contrastive estimation and curriculum learning for efficient visual-textual feature learning.

Findings

01

Achieves competitive action classification results on Kinetics without fine-tuning.

02

Provides effective initialization for downstream action recognition tasks.

03

Sets new state-of-the-art in zero-shot action recognition on UCF101.

Abstract

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MCG-NJU/CPD-Video
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning