TL;DR
This paper introduces a novel temporal contrastive learning framework for videos, employing two new loss functions to enhance temporal feature diversity and improve performance on various video understanding tasks.
Contribution
It proposes a new temporal contrastive learning framework with two innovative loss functions to explicitly encourage temporal feature distinction in self-supervised video representation learning.
Findings
Achieves 82.4% top-1 accuracy on UCF101 with 3D ResNet-18
Improves nearest neighbor video retrieval accuracy by 11.7% on UCF101
Significantly outperforms previous methods on multiple video understanding benchmarks
Abstract
Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the effect of explicitly encouraging the features to be distinct across the temporal dimension. We develop a new temporal contrastive learning framework consisting of two novel losses to improve upon existing contrastive self-supervised video representation learning methods. The local-local temporal contrastive loss adds the task of discriminating between non-overlapping clips from the same video, whereas the global-local temporal contrastive aims to discriminate between timesteps of the feature map of an input clip in order to increase the temporal diversity of the learned features. Our proposed temporal contrastive learning framework achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
