CPM: Class-conditional Prompting Machine for Audio-visual Segmentation
Yuanhong Chen, Chong Wang, Yuyuan Liu, Hu Wang, Gustavo Carneiro

TL;DR
This paper introduces CPM, a novel class-conditional prompting approach that enhances transformer-based audio-visual segmentation by improving cross-modal interaction and bipartite matching, achieving state-of-the-art results.
Contribution
The paper proposes CPM, a new framework that addresses training issues in AVS by combining class-agnostic and class-conditional queries and introducing new learning objectives.
Findings
Achieves state-of-the-art segmentation accuracy on AVS benchmarks.
Improves bipartite matching with a combined query learning strategy.
Enhances cross-modal attention effectiveness with new learning objectives.
Abstract
Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing
MethodsSoftmax · Attention Is All You Need
