Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation
Adam Sorrenti

TL;DR
This paper presents a U-Net-based neural network approach for accurately separating singing voices from musical tracks using spectrogram analysis, achieving high SDR, SIR, and SAR scores on the MUSDB18 dataset.
Contribution
It introduces a novel application of U-Net with frequency normalization and MAE loss for vocal separation, outperforming previous methods.
Findings
Achieved SDR of 7.1 dB indicating high separation quality.
Recorded SIR of 25.2 dB and SAR of 7.2 dB, surpassing other configurations.
Demonstrated the effectiveness of frequency normalization and MAE loss in vocal segmentation.
Abstract
Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
