PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
Yuchen He, Jing Zhang

TL;DR
PRIMED introduces an adaptive modality suppression approach inspired by neuroscience to improve referring audio-visual segmentation by dynamically emphasizing relevant modalities and suppressing irrelevant ones.
Contribution
It proposes a novel framework that models modality relevance and incorporates hierarchical global context for more accurate Ref-AVS, outperforming existing methods.
Findings
Achieves state-of-the-art performance on Ref-AVS benchmark.
Effectively suppresses irrelevant modalities for improved segmentation.
Enhances foreground-background discrimination with contrastive learning.
Abstract
Referring Audio-Visual Segmentation (Ref-AVS) seeks to localize and segment target objects in video frames based on visual, auditory, and textual referring cues. The task is challenging because the relevance of different modalities varies across referring expressions and scenes, while existing methods typically treat multimodal cues as homogeneous inputs for fusion, prompting, or reasoning, making them vulnerable to irrelevant or misleading modalities. To address this problem, we propose PRIMED, inspired by the biased competition theory in cognitive neuroscience, which explicitly models both visual perception and language-driven prior modulation, and enables more accurate Ref-AVS by adaptive modality suppression. Specifically, a Modality Prior Decoder first estimates whether the referring expression relies primarily on audio, vision, or their joint interaction, generating a modality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
