Learning Sound Localization Better From Semantically Similar Samples
Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon

TL;DR
This paper improves sound source localization in visual scenes by incorporating semantically similar samples as positives in contrastive learning, leading to better response map similarity and enhanced performance.
Contribution
It introduces a novel method that leverages semantically similar pairs as positives, addressing the issue of hard negatives in contrastive learning for sound localization.
Findings
Effective on VGG-SS and SoundNet-Flickr datasets
Outperforms state-of-the-art methods
Enhances response map similarity for semantically related pairs
Abstract
The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard positives", are mistakenly grouped as negatives. Our key contribution is showing that hard positives can give similar response maps to the corresponding pairs. Our approach incorporates these hard positives by adding their response maps into a contrastive learning objective directly. We demonstrate the effectiveness of our approach on VGG-SS and SoundNet-Flickr test sets, showing favorable performance to the state-of-the-art methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
MethodsContrastive Learning
