Composable Augmentation Encoding for Video Representation Learning
Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid

TL;DR
This paper introduces a novel contrastive learning framework for videos that explicitly encodes augmentation parameters, improving the representation's ability to capture temporal and spatial information and enhancing performance on video benchmarks.
Contribution
It proposes Composable Augmentation Encoding (CATE), a method that explicitly incorporates augmentation parameters into contrastive learning for better video representations.
Findings
Achieves state-of-the-art results on multiple video benchmarks.
Encodes valuable spatial and temporal augmentation information.
Improves downstream task performance by capturing invariances.
Abstract
We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives. These methods implicitly assume a set of representational invariances to the view selection mechanism (eg, sampling frames with temporal shifts), which may lead to poor performance on downstream tasks which violate these invariances (fine-grained video action recognition that would benefit from temporal information). To overcome this limitation, we propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations (such as the values of the time shifts used to create data views) as composable augmentation encodings (CATE) to our model when projecting the video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsContrastive Learning
