Can Temporal Information Help with Contrastive Self-Supervised Learning?

Yutong Bai; Haoqi Fan; Ishan Misra; Ganesh Venkatesh; Yongyi Lu; Yuyin; Zhou; Qihang Yu; Vikas Chandra; Alan Yuille

arXiv:2011.13046·cs.CV·November 30, 2020·29 cites

Can Temporal Information Help with Contrastive Self-Supervised Learning?

Yutong Bai, Haoqi Fan, Ishan Misra, Ganesh Venkatesh, Yongyi Lu, Yuyin, Zhou, Qihang Yu, Vikas Chandra, Alan Yuille

PDF

Open Access

TL;DR

This paper introduces TaCo, a novel framework that leverages carefully selected temporal transformations as both data augmentation and self-supervision signals to improve contrastive self-supervised learning for video understanding.

Contribution

The paper proposes a new paradigm, TaCo, that effectively integrates temporal information into video CSL by using temporal transformations for data augmentation and self-supervision, leading to significant performance gains.

Findings

01

TaCo outperforms previous state-of-the-art methods on UCF-101 and HMDB-51 datasets.

02

Temporal transformations as self-supervision improve video representation learning.

03

Direct application of temporal augmentations does not help, motivating the new approach.

Abstract

Leveraging temporal information has been regarded as essential for developing video understanding models. However, how to properly incorporate temporal information into the recent successful instance discrimination based contrastive self-supervised learning (CSL) framework remains unclear. As an intuitive solution, we find that directly applying temporal augmentations does not help, or even impair video CSL in general. This counter-intuitive observation motivates us to re-design existing video CSL frameworks, for better integration of temporal knowledge. To this end, we present Temporal-aware Contrastive self-supervised learningTaCo, as a general paradigm to enhance video CSL. Specifically, TaCo selects a set of temporal transformations not only as strong data augmentation but also to constitute extra self-supervision for video understanding. By jointly contrasting instances with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsCircular Smooth Label