Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick, Yuki M. Asano, Bernie Huang, Ishan Misra, Florian, Metze, Joao Henriques, Andrea Vedaldi

TL;DR
This paper introduces Space-Time Crop & Attend (STiCA), a novel method that enhances self-supervised video representation learning by efficiently applying spatial augmentations and transformer-based attention, achieving state-of-the-art results.
Contribution
The paper proposes Feature Crop for efficient spatial augmentation simulation and demonstrates the effectiveness of transformer-based attention in video representation learning.
Findings
Feature Crop enables scalable spatial augmentation in videos.
Transformer attention significantly improves video feature pooling.
STiCA achieves new state-of-the-art accuracy on benchmark datasets.
Abstract
The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers have ported these methods from still images to videos and found that leveraging both audio and video signals yields strong gains; however, they did not find that spatial augmentations such as cropping, which are very important for still images, work as well for videos. In this paper, we improve these formulations in two ways unique to the spatio-temporal aspect of videos. First, for space, we show that spatial augmentations such as cropping do work well for videos too, but that previous implementations, due to the high processing and memory cost, could not do this at a scale sufficient for it to work well. To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
