Space-Time Crop & Attend: Improving Cross-modal Video Representation   Learning

Mandela Patrick; Yuki M. Asano; Bernie Huang; Ishan Misra; Florian; Metze; Joao Henriques; Andrea Vedaldi

arXiv:2103.10211·cs.CV·October 28, 2021

Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning

Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian, Metze, Joao Henriques, Andrea Vedaldi

PDF

Open Access 1 Repo

TL;DR

This paper introduces Space-Time Crop & Attend (STiCA), a novel method that enhances self-supervised video representation learning by efficiently applying spatial augmentations and transformer-based attention, achieving state-of-the-art results.

Contribution

The paper proposes Feature Crop for efficient spatial augmentation simulation and demonstrates the effectiveness of transformer-based attention in video representation learning.

Findings

01

Feature Crop enables scalable spatial augmentation in videos.

02

Transformer attention significantly improves video feature pooling.

03

STiCA achieves new state-of-the-art accuracy on benchmark datasets.

Abstract

The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/GDT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition