The Impact of Spatiotemporal Augmentations on Self-Supervised   Audiovisual Representation Learning

Haider Al-Tahan; Yalda Mohsenzadeh

arXiv:2110.07082·cs.CV·October 15, 2021

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Haider Al-Tahan, Yalda Mohsenzadeh

PDF

Open Access

TL;DR

This paper investigates how spatiotemporal augmentations affect self-supervised audiovisual learning, finding that lossless, high-resolution, and strong temporal transformations significantly improve model performance across frameworks and datasets.

Contribution

It introduces effective spatiotemporal augmentation strategies for audiovisual self-supervised learning, demonstrating their scalability and compatibility across frameworks and datasets.

Findings

01

Lossless spatio-temporal transformations are most effective.

02

Transformations' effectiveness increases with higher temporal resolution.

03

Pre-training with proposed augmentations improves linear classifier performance by ~6.5%.

Abstract

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations. In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos. The type and strength of augmentations utilized during self-supervised pre-training play a crucial role for contrastive frameworks to work sufficiently. Hence, we extensively investigate composition of temporal augmentations suitable for learning audiovisual representations; we find lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective. Furthermore, we show that the effectiveness of these transformations scales with higher temporal resolution and stronger…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image Processing Techniques