Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
Chengzhi Li, Heyan Huang, Ping Jian, Yanghao Zhou

TL;DR
This paper proposes a weakly supervised method for audio-visual semantic segmentation that uses only video-level labels, employing a novel progressive cross-modal alignment approach to generate per-frame masks of sounding objects.
Contribution
It introduces WSAVSS and the PCAS framework, enabling effective segmentation with minimal supervision by aligning audio and visual data progressively.
Findings
Achieves state-of-the-art results among weakly supervised methods on AVS.
Performs competitively with fully supervised methods on AVSS.
Validates the effectiveness of progressive cross-modal alignment.
Abstract
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Subtitles and Audiovisual Media
