Unveiling Visual Biases in Audio-Visual Localization Benchmarks
Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin

TL;DR
This paper uncovers significant visual biases in AVSL benchmarks, showing that models relying only on visual cues outperform audiovisual models, which hampers the evaluation of true audio-visual localization capabilities.
Contribution
The study identifies and validates visual biases in AVSL benchmarks, highlighting the need for refined datasets to better evaluate audio-visual localization models.
Findings
Vision-only models outperform audiovisual baselines on current benchmarks.
Existing AVSL benchmarks are biased towards visual cues.
Refinement of benchmarks is necessary for accurate AVSL evaluation.
Abstract
Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage
