TL;DR
This paper introduces an Induction Network that effectively bridges the gap between audio and visual modalities for self-supervised sound source localization, improving alignment and robustness over previous contrastive learning methods.
Contribution
The proposed Induction Network decouples modality gradients and uses an induction vector to enhance cross-modal alignment, addressing heterogeneity issues in self-supervised learning.
Findings
Outperforms state-of-the-art methods on SoundNet-Flickr and VGG-Sound datasets.
Improves robustness with adaptive threshold selection.
Effectively aligns audio and visual modalities in challenging scenarios.
Abstract
Self-supervised sound source localization is usually challenged by the modality inconsistency. In recent studies, contrastive learning based strategies have shown promising to establish such a consistent correspondence between audio and sound sources in visual scenarios. Unfortunately, the insufficient attention to the heterogeneity influence in the different modality features still limits this scheme to be further improved, which also becomes the motivation of our work. In this study, an Induction Network is proposed to bridge the modality gap more effectively. By decoupling the gradients of visual and audio modalities, the discriminative visual representations of sound sources can be learned with the designed Induction Vector in a bootstrap manner, which also enables the audio modality to be aligned with the visual modality consistently. In addition to a visual weighted contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
