ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring

Ari Frummer; Helin Wang; Tianyu Cao; Adi Arbel; Yuval Sieradzki; Oren Gal; Jes\'us Villalba; Thomas Thebaud; Najim Dehak

arXiv:2510.21014·eess.AS·October 28, 2025

ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring

Ari Frummer, Helin Wang, Tianyu Cao, Adi Arbel, Yuval Sieradzki, Oren Gal, Jes\'us Villalba, Thomas Thebaud, Najim Dehak

PDF

Open Access

TL;DR

This paper presents ReFESS-QI, a reference-free, self-supervised evaluation framework for speech separation that jointly predicts audio quality and intelligibility without needing reference audios.

Contribution

It introduces a novel SSL-based method for evaluating speech separation quality and intelligibility directly from mixtures and separated tracks, applicable in real-world scenarios.

Findings

01

Achieves 17% WER estimation MAE and 0.77 PCC on WHAMR! dataset.

02

Attains 1.38 SI-SNR MAE and 0.95 PCC, demonstrating high accuracy.

03

Proves robustness across various SSL representations.

Abstract

Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing