Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Chengzhi Li; Heyan Huang; Ping Jian; Yanghao Zhou

arXiv:2603.21948·cs.MM·March 24, 2026

Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation

Chengzhi Li, Heyan Huang, Ping Jian, Yanghao Zhou

PDF

Open Access

TL;DR

This paper proposes a weakly supervised method for audio-visual semantic segmentation that uses only video-level labels, employing a novel progressive cross-modal alignment approach to generate per-frame masks of sounding objects.

Contribution

It introduces WSAVSS and the PCAS framework, enabling effective segmentation with minimal supervision by aligning audio and visual data progressively.

Findings

01

Achieves state-of-the-art results among weakly supervised methods on AVS.

02

Performs competitively with fully supervised methods on AVSS.

03

Validates the effectiveness of progressive cross-modal alignment.

Abstract

Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Subtitles and Audiovisual Media