Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation
Daniel Stoller, Sebastian Ewert, Simon Dixon

TL;DR
This paper introduces Wave-U-Net, a time-domain neural network for audio source separation that models phase information and achieves performance comparable to spectrogram-based methods, while addressing evaluation metric issues.
Contribution
The paper presents Wave-U-Net, a novel end-to-end time-domain architecture with multi-scale processing, architectural improvements, and a new evaluation reporting method for audio source separation.
Findings
Wave-U-Net performs comparably to state-of-the-art spectrogram-based models.
Architectural enhancements improve separation quality and reduce artifacts.
Reporting rank-based statistics mitigates outlier issues in SDR evaluation.
Abstract
Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsConcatenated Skip Connection · *Communicated@Fast*How Do I Communicate to Expedia? · Max Pooling · Convolution · U-Net
