TL;DR
This paper introduces a novel audio-visual learning framework that effectively eliminates false negatives in sound source localization, improving performance in localization, event classification, and object detection tasks.
Contribution
It proposes two complementary schemes, SSPL and SACL, to address false negatives in contrastive learning, enhancing audio-visual feature alignment and robustness.
Findings
Outperforms state-of-the-art methods in sound source localization
Improves accuracy in audio-visual event classification
Enhances object detection performance
Abstract
Sound source localization aims to localize objects emitting the sound in visual scenes. Recent works obtaining impressive results typically rely on contrastive learning. However, the common practice of randomly sampling negatives in prior arts can lead to the false negative issue, where the sounds semantically similar to visual instance are sampled as negatives and incorrectly pushed away from the visual anchor/query. As a result, this misalignment of audio and visual features could yield inferior performance. To address this issue, we propose a novel audio-visual learning framework which is instantiated with two individual learning schemes: self-supervised predictive learning (SSPL) and semantic-aware contrastive learning (SACL). SSPL explores image-audio positive pairs alone to discover semantically coherent similarities between audio and visual features, while a predictive coding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsContrastive Learning
