Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Liangyu Chen; Zihao Yue; Boshen Xu; Qin Jin

arXiv:2409.06709·cs.MM·September 12, 2024

Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Liangyu Chen, Zihao Yue, Boshen Xu, Qin Jin

PDF

Open Access

TL;DR

This paper uncovers significant visual biases in AVSL benchmarks, showing that models relying only on visual cues outperform audiovisual models, which hampers the evaluation of true audio-visual localization capabilities.

Contribution

The study identifies and validates visual biases in AVSL benchmarks, highlighting the need for refined datasets to better evaluate audio-visual localization models.

Findings

01

Vision-only models outperform audiovisual baselines on current benchmarks.

02

Existing AVSL benchmarks are biased towards visual cues.

03

Refinement of benchmarks is necessary for accurate AVSL evaluation.

Abstract

Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage