Stepping Stones: A Progressive Training Strategy for Audio-Visual Semantic Segmentation
Juncheng Ma, Peiwen Sun, Yaoting Wang, Di Hu

TL;DR
This paper introduces a two-stage training strategy called Stepping Stones for audio-visual semantic segmentation, improving learning efficiency and achieving state-of-the-art results by decomposing the task into localization and semantic understanding.
Contribution
The paper proposes a novel two-stage training approach for AVSS and a new adaptive framework with an audio query generator and masked attention for enhanced feature fusion.
Findings
Achieves state-of-the-art results on AVS benchmarks.
Demonstrates generalization of the training strategy to existing methods.
Improves performance by adaptive fusion of visual and audio features.
Abstract
Audio-Visual Segmentation (AVS) aims to achieve pixel-level localization of sound sources in videos, while Audio-Visual Semantic Segmentation (AVSS), as an extension of AVS, further pursues semantic understanding of audio-visual scenes. However, since the AVSS task requires the establishment of audio-visual correspondence and semantic understanding simultaneously, we observe that previous methods have struggled to handle this mashup of objectives in end-to-end training, resulting in insufficient learning and sub-optimization. Therefore, we propose a two-stage training strategy called \textit{Stepping Stones}, which decomposes the AVSS task into two simple subtasks from localization to semantic understanding, which are fully optimized in each stage to achieve step-by-step global optimization. This training strategy has also proved its generalization and effectiveness on existing methods.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsSoftmax · Attention Is All You Need
