Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations
Paul A. Bereuter, Alois Sontacchi

TL;DR
This paper proposes embedding-based intrusive evaluation metrics using MERT representations for musical source separation, demonstrating stronger correlation with perceptual quality than traditional metrics.
Contribution
It introduces MERT embedding-based metrics for MSS evaluation and shows they outperform BSS-Eval metrics in correlating with perceptual quality.
Findings
MERT-based metrics have higher correlation with perceptual ratings than BSS-Eval.
Experiments on two datasets confirm the effectiveness of embedding-based metrics.
Embedding-based metrics outperform traditional metrics across various models and stems.
Abstract
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fr\'echet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
