Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models

Paul A. Bereuter; Benjamin Stahl; Mark D. Plumbley; Alois Sontacchi

arXiv:2507.11427·eess.AS·November 19, 2025·WASPAA

Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models

Paul A. Bereuter, Benjamin Stahl, Mark D. Plumbley, Alois Sontacchi

PDF

1 Repo 4 Models

TL;DR

This paper investigates the limitations of traditional evaluation metrics for generative singing voice separation models and proposes alternative metrics that better correlate with human perceptual quality.

Contribution

The study analyzes correlations between objective audio quality metrics and human opinion scores, identifying more reliable metrics for evaluating generative singing voice separation models.

Findings

01

Intrusive embedding-based metrics correlate better with human scores than traditional metrics.

02

MSE on Music2Latent embeddings yields highest correlation for discriminative models.

03

Multi-resolution STFT loss and MSE on MERT-L12 embeddings are most effective for generative models.

Abstract

Traditional Blind Source Separation Evaluation (BSS-Eval) metrics were originally designed to evaluate linear audio source separation models based on methods such as time-frequency masking. However, recent generative models may introduce nonlinear relationships between the separated and reference signals, limiting the reliability of these metrics for objective evaluation. To address this issue, we conduct a Degradation Category Rating listening test and analyze correlations between the obtained degradation mean opinion scores (DMOS) and a set of objective audio quality metrics for the task of singing voice separation. We evaluate three state-of-the-art discriminative models and two new competitive generative models. For both discriminative and generative models, intrusive embedding-based metrics show higher correlations with DMOS than conventional intrusive metrics such as BSS-Eval. For…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pablebe/gensvs_eval
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training