On Temporal Guidance and Iterative Refinement in Audio Source Separation

Tobias Morocutti; Jonathan Greif; Paul Primus; Florian Schmid; Gerhard Widmer

arXiv:2507.17297·cs.SD·July 24, 2025

On Temporal Guidance and Iterative Refinement in Audio Source Separation

Tobias Morocutti, Jonathan Greif, Paul Primus, Florian Schmid, Gerhard Widmer

PDF

Open Access

TL;DR

This paper introduces a novel audio source separation method that combines fine-grained temporal sound event detection with iterative refinement, significantly improving separation quality in complex sound scenes.

Contribution

It proposes a new approach that integrates Transformer-based sound event detection with iterative refinement for enhanced audio source separation.

Findings

01

Achieved second place in DCASE Challenge 2025 Task 4

02

Significant improvements in audio tagging accuracy

03

Enhanced source separation quality through iterative refinement

Abstract

Spatial semantic segmentation of sound scenes (S5) involves the accurate identification of active sound classes and the precise separation of their sources from complex acoustic mixtures. Conventional systems rely on a two-stage pipeline - audio tagging followed by label-conditioned source separation - but are often constrained by the absence of fine-grained temporal information critical for effective separation. In this work, we address this limitation by introducing a novel approach for S5 that enhances the synergy between the event detection and source separation stages. Our key contributions are threefold. First, we fine-tune a pre-trained Transformer to detect active sound classes. Second, we utilize a separate instance of this fine-tuned Transformer to perform sound event detection (SED), providing the separation module with detailed, time-varying guidance. Third, we implement an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing