Implicit Counterfactual Learning for Audio-Visual Segmentation
Mingfeng Zha, Tianyu Li, Guoqing Wang, Peng Wang, Yangyang Wu, Yang Yang, Heng Tao Shen

TL;DR
This paper introduces an implicit counterfactual learning framework for audio-visual segmentation that reduces modality gaps, mitigates biases, and improves cross-modal understanding, leading to state-of-the-art results.
Contribution
The paper proposes the implicit counterfactual framework with multi-granularity implicit text, semantic counterfactuals, and distribution-aware contrastive learning for unbiased AVS.
Findings
Achieves state-of-the-art performance on three datasets.
Effectively reduces modality gaps and biases.
Improves cross-modal understanding in complex scenes.
Abstract
Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
