Multi-scale temporal-frequency attention for music source separation
Lianwu Chen, Xiguang Zheng, Chen Zhang, Liang Guo, Bing Yu

TL;DR
This paper introduces a multi-scale temporal-frequency attention module for music source separation, explicitly modeling spectrogram correlations to improve separation quality, achieving state-of-the-art results on MUSDB18.
Contribution
It proposes a novel attention mechanism that captures multi-scale temporal and frequency correlations in spectrograms for music source separation.
Findings
Outperforms existing methods with 9.51 dB SDR on vocal separation
Effectively models spectrogram correlations across multiple scales
Achieves state-of-the-art performance on MUSDB18 dataset
Abstract
In recent years, deep neural networks (DNNs) based approaches have achieved the start-of-the-art performance for music source separation (MSS). Although previous methods have addressed the large receptive field modeling using various methods, the temporal and frequency correlations of the music spectrogram with repeated patterns have not been explicitly explored for the MSS task. In this paper, a temporal-frequency attention module is proposed to model the spectrogram correlations along both temporal and frequency dimensions. Moreover, a multi-scale attention is proposed to effectively capture the correlations for music signal. The experimental results on MUSDB18 dataset show that the proposed method outperforms the existing state-of-the-art systems with 9.51 dB signal-to-distortion ratio (SDR) on separating the vocal stems, which is the primary practical application of MSS.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
