Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?
Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

TL;DR
This paper critically evaluates whether current audio-visual segmentation models truly utilize audio cues for segmenting sounding objects, revealing a visual bias and proposing a robust benchmark and method to improve reliability.
Contribution
It introduces AVSBench-Robust, a new benchmark with negative audio scenarios, and proposes a training approach that reduces visual bias and enhances robustness in AVS models.
Findings
Current AVS models rely heavily on visual salience, ignoring audio cues.
Models perform poorly under negative audio conditions, indicating a visual bias.
The proposed method significantly improves robustness and maintains high segmentation quality.
Abstract
Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing
MethodsSegment Anything Model
