Exploiting Transformation Invariance and Equivariance for   Self-supervised Sound Localisation

Jinxiang Liu; Chen Ju; Weidi Xie; Ya Zhang

arXiv:2206.12772·cs.CV·August 16, 2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang

PDF

TL;DR

This paper introduces a self-supervised audio-visual learning framework that leverages transformation invariance and equivariance to improve sound source localization and retrieval tasks in videos.

Contribution

It systematically investigates the effects of data augmentations, emphasizing the importance of transformation invariance and equivariance for better multi-modal representations.

Findings

01

Outperforms previous methods on Flickr-SoundNet and VGG-Sound benchmarks.

02

Achieves competitive results with supervised methods in audio retrieval.

03

Enhances generalization in cross-modal retrieval tasks.

Abstract

We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.