Music Source Separation in the Waveform Domain
Alexandre D\'efossez (FAIR, SIERRA, PSL), Nicolas Usunier (FAIR),, L\'eon Bottou (FAIR), Francis Bach (DI-ENS, PSL, SIERRA)

TL;DR
This paper introduces Demucs, a novel waveform-to-waveform music source separation model with a U-Net and LSTM, outperforming existing methods in accuracy and naturalness, and capable of being efficiently compressed.
Contribution
Demucs is a new waveform domain model for music source separation that surpasses state-of-the-art spectrogram-based methods and Conv-Tasnet, with improved naturalness and efficiency.
Findings
Demucs achieves 6.3 SDR on MusDB, surpassing previous methods.
Proper data augmentation enhances Demucs performance.
Demucs can be compressed to 120MB without accuracy loss.
Abstract
Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.Contrarily to many audio synthesis tasks where the best performances are achieved by models that directly generate the waveform, the state-of-the-art in source separation for music is to compute masks on the magnitude spectrum. In this paper, we compare two waveform domain architectures. We first adapt Conv-Tasnet, initially developed for speech source separation,to the task of music source separation. While Conv-Tasnet beats many existing spectrogram-domain methods, it suffersfrom significant artifacts, as shown by human evaluations. We propose instead Demucs, a novel waveform-to-waveform model,with a U-Net structure and bidirectional LSTM.Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
