Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang; Peiwen Sun; Yuanchao Li; Honggang Zhang; Di Hu

arXiv:2407.10947·cs.CV·July 16, 2024·1 cites

Can Textual Semantics Mitigate Sounding Object Segmentation Preference?

Yaoting Wang, Peiwen Sun, Yuanchao Li, Honggang Zhang, Di Hu

PDF

Open Access 1 Repo

TL;DR

This paper proposes leveraging text cues derived from scene descriptions to improve audio guidance in audio-visual segmentation, addressing the weak semantics of audio in multi-source scenes and enhancing segmentation accuracy.

Contribution

It introduces a novel semantics-driven audio modeling module that integrates text cues with audio features to improve segmentation performance.

Findings

01

Enhanced sensitivity to audio cues with text assistance

02

Achieved competitive results on AVS benchmarks

03

Demonstrated the effectiveness of text-aided audio modeling

Abstract

The Audio-Visual Segmentation (AVS) task aims to segment sounding objects in the visual space using audio cues. However, in this work, it is recognized that previous AVS methods show a heavy reliance on detrimental segmentation preferences related to audible objects, rather than precise audio guidance. We argue that the primary reason is that audio lacks robust semantics compared to vision, especially in multi-source sounding scenes, resulting in weak audio guidance over the visual space. Motivated by the the fact that text modality is well explored and contains rich abstract semantics, we propose leveraging text cues from the visual scene to enhance audio guidance with the semantics inherent in text. Our approach begins by obtaining scene descriptions through an off-the-shelf image captioner and prompting a frozen large language model to deduce potential sounding objects as text cues.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gewu-lab/sounding-object-segmentation-preference
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Advanced Text Analysis Techniques