A Closer Look at Weakly-Supervised Audio-Visual Source Localization
Shentong Mo, Pedro Morgado

TL;DR
This paper critically evaluates weakly-supervised audio-visual source localization methods, introduces improved evaluation protocols with negative samples, and proposes a new approach that outperforms previous methods by addressing overfitting and negative detection.
Contribution
It extends benchmark datasets with negative samples, develops new evaluation metrics, and introduces a novel localization method using visual dropout and momentum encoders.
Findings
Most prior methods cannot detect negatives effectively.
Existing methods suffer from overfitting and rely on early stopping.
The proposed approach achieves state-of-the-art results on benchmark datasets.
Abstract
Audio-visual source localization is a challenging task that aims to predict the location of visual sound sources in a video. Since collecting ground-truth annotations of sounding objects can be costly, a plethora of weakly-supervised localization methods that can learn from datasets with no bounding-box annotations have been proposed in recent years, by leveraging the natural co-occurrence of audio and visual signals. Despite significant interest, popular evaluation protocols have two major flaws. First, they allow for the use of a fully annotated dataset to perform early stopping, thus significantly increasing the annotation effort required for training. Second, current evaluation metrics assume the presence of sound sources at all times. This is of course an unrealistic assumption, and thus better metrics are necessary to capture the model's performance on (negative) samples with no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Video Analysis and Summarization
MethodsTest · Early Stopping · Dropout
