Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale, Haosen Yang, Diptesh Kanojia, Jiangkang Deng, Xiatian, Zhu

TL;DR
This paper introduces an unsupervised audio-visual segmentation method called MoCA that leverages foundation models and contrastive learning to identify sound-producing objects at the pixel level without costly annotations.
Contribution
The paper presents a novel unsupervised learning framework, MoCA, integrating foundation models for multi-modality association in audio-visual segmentation, reducing reliance on annotated data.
Findings
MoCA outperforms baseline methods on AVSBench and AVSS datasets.
Achieves significant mIoU improvements, e.g., +17.24% on AVSBench S4.
Approaches supervised performance in complex multi-object scenarios.
Abstract
Audio-Visual Segmentation (AVS) aims to identify, at the pixel level, the object in a visual scene that produces a given sound. Current AVS methods rely on costly fine-grained annotations of mask-audio pairs, making them impractical for scalability. To address this, we introduce unsupervised AVS, eliminating the need for such expensive annotation. To tackle this more challenging problem, we propose an unsupervised learning method, named Modality Correspondence Alignment (MoCA), which seamlessly integrates off-the-shelf foundation models like DINO, SAM, and ImageBind. This approach leverages their knowledge complementarity and optimizes their joint usage for multi-modality association. Initially, we estimate positive and negative image pairs in the feature space. For pixel-level association, we introduce an audio-visual adapter and a novel pixel matching aggregation strategy within the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Digital Media Forensic Detection
MethodsAttention Is All You Need · Softmax · Residual Connection · Linear Layer · Dense Connections · Layer Normalization · Multi-Head Attention · Vision Transformer · self-DIstillation with NO labels · Segment Anything Model
