The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning
Haider Al-Tahan, Yalda Mohsenzadeh

TL;DR
This paper investigates how spatiotemporal augmentations affect self-supervised audiovisual learning, finding that lossless, high-resolution, and strong temporal transformations significantly improve model performance across frameworks and datasets.
Contribution
It introduces effective spatiotemporal augmentation strategies for audiovisual self-supervised learning, demonstrating their scalability and compatibility across frameworks and datasets.
Findings
Lossless spatio-temporal transformations are most effective.
Transformations' effectiveness increases with higher temporal resolution.
Pre-training with proposed augmentations improves linear classifier performance by ~6.5%.
Abstract
Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations. In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos. The type and strength of augmentations utilized during self-supervised pre-training play a crucial role for contrastive frameworks to work sufficiently. Hence, we extensively investigate composition of temporal augmentations suitable for learning audiovisual representations; we find lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective. Furthermore, we show that the effectiveness of these transformations scales with higher temporal resolution and stronger…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Advanced Image Processing Techniques
