On Temporal Guidance and Iterative Refinement in Audio Source Separation
Tobias Morocutti, Jonathan Greif, Paul Primus, Florian Schmid, Gerhard Widmer

TL;DR
This paper introduces a novel audio source separation method that combines fine-grained temporal sound event detection with iterative refinement, significantly improving separation quality in complex sound scenes.
Contribution
It proposes a new approach that integrates Transformer-based sound event detection with iterative refinement for enhanced audio source separation.
Findings
Achieved second place in DCASE Challenge 2025 Task 4
Significant improvements in audio tagging accuracy
Enhanced source separation quality through iterative refinement
Abstract
Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline - audio tagging followed by label-conditioned source separation - but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
