ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring
Ari Frummer, Helin Wang, Tianyu Cao, Adi Arbel, Yuval Sieradzki, Oren Gal, Jes\'us Villalba, Thomas Thebaud, Najim Dehak

TL;DR
This paper presents ReFESS-QI, a reference-free, self-supervised evaluation framework for speech separation that jointly predicts audio quality and intelligibility without needing reference audios.
Contribution
It introduces a novel SSL-based method for evaluating speech separation quality and intelligibility directly from mixtures and separated tracks, applicable in real-world scenarios.
Findings
Achieves 17% WER estimation MAE and 0.77 PCC on WHAMR! dataset.
Attains 1.38 SI-SNR MAE and 0.95 PCC, demonstrating high accuracy.
Proves robustness across various SSL representations.
Abstract
Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
