Composable Augmentation Encoding for Video Representation Learning

Chen Sun; Arsha Nagrani; Yonglong Tian; Cordelia Schmid

arXiv:2104.00616·cs.CV·August 23, 2021

Composable Augmentation Encoding for Video Representation Learning

Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid

PDF

Open Access

TL;DR

This paper introduces a novel contrastive learning framework for videos that explicitly encodes augmentation parameters, improving the representation's ability to capture temporal and spatial information and enhancing performance on video benchmarks.

Contribution

It proposes Composable Augmentation Encoding (CATE), a method that explicitly incorporates augmentation parameters into contrastive learning for better video representations.

Findings

01

Achieves state-of-the-art results on multiple video benchmarks.

02

Encodes valuable spatial and temporal augmentation information.

03

Improves downstream task performance by capturing invariances.

Abstract

We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsContrastive Learning