SCVRL: Shuffled Contrastive Video Representation Learning

Michael Dorkenwald; Fanyi Xiao; Biagio Brattoli; Joseph Tighe; Davide; Modolo

arXiv:2205.11710·cs.CV·May 25, 2022

SCVRL: Shuffled Contrastive Video Representation Learning

Michael Dorkenwald, Fanyi Xiao, Biagio Brattoli, Joseph Tighe, Davide, Modolo

PDF

Open Access

TL;DR

SCVRL introduces a contrastive learning framework that effectively captures both semantic and motion patterns in videos, leveraging a transformer-based network to outperform existing methods on multiple benchmarks.

Contribution

It reformulates the shuffling pretext task within a contrastive learning paradigm and demonstrates the effectiveness of transformers in learning motion in self-supervised video representations.

Findings

01

Outperforms CVRL on four benchmarks

02

Capable of learning both semantic and motion patterns

03

Uses a transformer-based network for video representation

Abstract

We propose SCVRL, a novel contrastive-based framework for self-supervised learning for videos. Differently from previous contrast learning based methods that mostly focus on learning visual semantics (e.g., CVRL), SCVRL is capable of learning both semantic and motion patterns. For that, we reformulate the popular shuffling pretext task within a modern contrastive learning paradigm. We show that our transformer-based network has a natural capacity to learn motion in self-supervised settings and achieves strong performance, outperforming CVRL on four benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging

MethodsDense Connections · Temporally Consistent Spatial Augmentation · 3D Convolution · Contrastive Learning · Contrastive Video Representation Learning