Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Jia Li; Wenjie Zhao; Ziru Huang; Yunhui Guo; Yapeng Tian

arXiv:2502.00358·cs.SD·February 24, 2025

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Jia Li, Wenjie Zhao, Ziru Huang, Yunhui Guo, Yapeng Tian

PDF

Open Access 1 Video

TL;DR

This paper critically evaluates whether current audio-visual segmentation models truly utilize audio cues for segmenting sounding objects, revealing a visual bias and proposing a robust benchmark and method to improve reliability.

Contribution

It introduces AVSBench-Robust, a new benchmark with negative audio scenarios, and proposes a training approach that reduces visual bias and enhances robustness in AVS models.

Findings

01

Current AVS models rely heavily on visual salience, ignoring audio cues.

02

Models perform poorly under negative audio conditions, indicating a visual bias.

03

The proposed method significantly improves robustness and maintains high segmentation quality.

Abstract

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches, leveraging transformer architectures and powerful foundation models like SAM, have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? In this paper, we systematically investigate this issue in the context of robust AVS. Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context. This bias results in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?· underline

Taxonomy

TopicsMusic and Audio Processing

MethodsSegment Anything Model