Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation
Jinxiang Liu, Chen Ju, Weidi Xie, Ya Zhang

TL;DR
This paper introduces a self-supervised audio-visual learning framework that leverages transformation invariance and equivariance to improve sound source localization and retrieval tasks in videos.
Contribution
It systematically investigates the effects of data augmentations, emphasizing the importance of transformation invariance and equivariance for better multi-modal representations.
Findings
Outperforms previous methods on Flickr-SoundNet and VGG-Sound benchmarks.
Achieves competitive results with supervised methods in audio retrieval.
Enhances generalization in cross-modal retrieval tasks.
Abstract
We present a simple yet effective self-supervised framework for audio-visual representation learning, to localize the sound source in videos. To understand what enables to learn useful representations, we systematically investigate the effects of data augmentations, and reveal that (1) composition of data augmentations plays a critical role, i.e. explicitly encouraging the audio-visual representations to be invariant to various transformations~({\em transformation invariance}); (2) enforcing geometric consistency substantially improves the quality of learned representations, i.e. the detected sound source should follow the same transformation applied on input video frames~({\em transformation equivariance}). Extensive experiments demonstrate that our model significantly outperforms previous methods on two sound localization benchmarks, namely, Flickr-SoundNet and VGG-Sound.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
