TL;DR
This paper investigates the limitations of traditional evaluation metrics for generative singing voice separation models and proposes alternative metrics that better correlate with human perceptual quality.
Contribution
The study analyzes correlations between objective audio quality metrics and human opinion scores, identifying more reliable metrics for evaluating generative singing voice separation models.
Findings
Intrusive embedding-based metrics correlate better with human scores than traditional metrics.
MSE on Music2Latent embeddings yields highest correlation for discriminative models.
Multi-resolution STFT loss and MSE on MERT-L12 embeddings are most effective for generative models.
Abstract
Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval. For…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
