Temporal Contrastive Learning with Curriculum
Shuvendu Roy, Ali Etemad

TL;DR
ConCur introduces a curriculum-based contrastive learning approach for videos, progressively increasing difficulty in positive sample selection to improve action recognition and retrieval performance.
Contribution
It proposes a novel curriculum learning strategy in contrastive video representation learning, including an auxiliary task for temporal distance prediction.
Findings
Achieves state-of-the-art results on UCF101 and HMDB51 datasets.
Effective across different encoder backbones and pre-training datasets.
Component ablation confirms the importance of curriculum and auxiliary tasks.
Abstract
We present ConCur, a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy in contrastive training. More specifically, ConCur starts the contrastive training with easy positive samples (temporally close and semantically similar clips), and as the training progresses, it increases the temporal span effectively sampling hard positives (temporally away and semantically dissimilar). To learn better context-aware representations, we also propose an auxiliary task of predicting the temporal distance between a positive pair of clips. We conduct extensive experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance on two benchmark tasks of video action recognition and video retrieval. We explore the impact of encoder backbones and pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Dense Connections · (2+1)D Convolution · Average Pooling · Global Average Pooling · Residual Connection · R(2+1)D
