Audio-Visual Segmentation by Exploring Cross-Modal Mutual Semantics
Chen Liu, Peike Li, Xingqun Qi, Hu Zhang, Lincheng Li, Dadong Wang,, Xin Yu

TL;DR
This paper introduces an audio-visual instance-aware segmentation method that overcomes dataset bias by localizing potential sounding objects and associating them with audio, utilizing silent object-aware segmentation and semantic correlation.
Contribution
The proposed approach uniquely combines silent object-aware segmentation with audio-visual semantic correlation to improve sounding object localization beyond dataset saliency bias.
Findings
Effective segmentation of sounding objects demonstrated on AVS benchmarks.
Reduces bias towards salient objects in audio-visual segmentation.
Improves association of audio with potential visual objects.
Abstract
The audio-visual segmentation (AVS) task aims to segment sounding objects from a given video. Existing works mainly focus on fusing audio and visual features of a given video to achieve sounding object masks. However, we observed that prior arts are prone to segment a certain salient object in a video regardless of the audio information. This is because sounding objects are often the most salient ones in the AVS dataset. Thus, current AVS methods might fail to localize genuine sounding objects due to the dataset bias. In this work, we present an audio-visual instance-aware segmentation approach to overcome the dataset bias. In a nutshell, our method first localizes potential sounding objects in a video by an object segmentation network, and then associates the sounding object candidates with the given audio. We notice that an object could be a sounding object in one video but a silent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Digital Media Forensic Detection · Speech and Audio Processing
Methodsfail · Focus
