Implicit Counterfactual Learning for Audio-Visual Segmentation

Mingfeng Zha; Tianyu Li; Guoqing Wang; Peng Wang; Yangyang Wu; Yang Yang; Heng Tao Shen

arXiv:2507.20740·cs.CV·July 29, 2025

Implicit Counterfactual Learning for Audio-Visual Segmentation

Mingfeng Zha, Tianyu Li, Guoqing Wang, Peng Wang, Yangyang Wu, Yang Yang, Heng Tao Shen

PDF

Open Access

TL;DR

This paper introduces an implicit counterfactual learning framework for audio-visual segmentation that reduces modality gaps, mitigates biases, and improves cross-modal understanding, leading to state-of-the-art results.

Contribution

The paper proposes the implicit counterfactual framework with multi-granularity implicit text, semantic counterfactuals, and distribution-aware contrastive learning for unbiased AVS.

Findings

01

Achieves state-of-the-art performance on three datasets.

02

Effectively reduces modality gaps and biases.

03

Improves cross-modal understanding in complex scenes.

Abstract

Audio-visual segmentation (AVS) aims to segment objects in videos based on audio cues. Existing AVS methods are primarily designed to enhance interaction efficiency but pay limited attention to modality representation discrepancies and imbalances. To overcome this, we propose the implicit counterfactual framework (ICF) to achieve unbiased cross-modal understanding. Due to the lack of semantics, heterogeneous representations may lead to erroneous matches, especially in complex scenes with ambiguous visual content or interference from multiple audio sources. We introduce the multi-granularity implicit text (MIT) involving video-, segment- and frame-level as the bridge to establish the modality-shared space, reducing modality gaps and providing prior guidance. Visual content carries more information and typically dominates, thereby marginalizing audio features in the decision-making. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis