Finding Strength in Weakness: Learning to Separate Sounds with Weak Supervision
Fatemeh Pishdadian, Gordon Wichern, Jonathan Le Roux

TL;DR
This paper introduces a weakly supervised learning approach for audio source separation that does not require isolated source signals during training, enabling separation in more general and less controlled environments.
Contribution
It proposes novel objective functions and network architectures that leverage weak labels, such as clip-level or frame-level annotations, for training source separation models.
Findings
Achieves significant SI-SDR improvement with weak supervision
Performs well on urban sound mixtures with overlapping events
Enables training without isolated source data
Abstract
While there has been much recent progress using deep learning techniques to separate speech and music audio signals, these systems typically require large collections of isolated sources during the training process. When extending audio source separation algorithms to more general domains such as environmental monitoring, it may not be possible to obtain isolated signals for training. Here, we propose objective functions and network architectures that enable training a source separation system with weak labels. In this scenario, weak labels are defined in contrast with strong time-frequency (TF) labels such as those obtained from isolated sources, and refer either to frame-level weak labels where one only has access to the time periods when different sources are active in an audio mixture, or to clip-level weak labels that only indicate the presence or absence of sounds in an entire…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
