Musical Source Separation Bake-Off: Comparing Objective Metrics with Human Perception
Noah Jaffe, John Ashley Burgoyne

TL;DR
This study evaluates how well various objective metrics predict human perception of music source separation quality, revealing that no single metric is universally reliable and emphasizing the importance of stem-specific evaluation methods.
Contribution
It provides a large-scale listener dataset and compares multiple objective metrics, highlighting their strengths and limitations in predicting human perception across different music stems.
Findings
SDR best predicts vocal quality
SI-SAR better correlates with perception for drums and bass
FAD with CLAP-LAION-music performs well for drums and bass
Abstract
Music source separation aims to extract individual sound sources (e.g., vocals, drums, guitar) from a mixed music recording. However, evaluating the quality of separated audio remains challenging, as commonly used metrics like the source-to-distortion ratio (SDR) do not always align with human perception. In this study, we conducted a large-scale listener evaluation on the MUSDB18 test set, collecting approximately 30 ratings per track from seven distinct listener groups. We compared several objective energy-ratio metrics, including legacy measures (BSSEval v4, SI-SDR variants), and embedding-based alternatives (Frechet Audio Distance using CLAP-LAION-music, EnCodec, VGGish, Wave2Vec2, and HuBERT). While SDR remains the best-performing metric for vocal estimates, our results show that the scale-invariant signal-to-artifacts ratio (SI-SAR) better predicts listener ratings for drums and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Speech and Audio Processing · Music and Audio Processing
