Generalization Challenges for Neural Architectures in Audio Source Separation
Shariq Mobin, Brian Cheung, Bruno Olshausen

TL;DR
This paper compares recurrent and convolutional neural networks for audio source separation, demonstrating that convolutional models achieve state-of-the-art results with fewer parameters and better generalization to new environments.
Contribution
It introduces a convolutional neural network approach for source separation, outperforming recurrent models in efficiency and robustness, and presents a new dataset for real-world testing.
Findings
Convolutional models achieve state-of-the-art separation performance.
Convolutional models generalize better to unseen environments.
Environmental acoustics significantly affect model performance.
Abstract
Recent work has shown that recurrent neural networks can be trained to separate individual speakers in a sound mixture with high fidelity. Here we explore convolutional neural network models as an alternative and show that they achieve state-of-the-art results with an order of magnitude fewer parameters. We also characterize and compare the robustness and ability of these different approaches to generalize under three different test conditions: longer time sequences, the addition of intermittent noise, and different datasets not seen during training. For the last condition, we create a new dataset, RealTalkLibri, to test source separation in real-world environments. We show that the acoustics of the environment have significant impact on the structure of the waveform and the overall performance of neural network models, with the convolutional model showing superior ability to generalize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
