REAL-M: Towards Speech Separation on Real Mixtures
Cem Subakan, Mirco Ravanelli, Samuele Cornell, Fran\c{c}ois Grondin

TL;DR
This paper introduces the REAL-M dataset of real-world speech mixtures and proposes a neural estimator for evaluating separation performance without ground truth, demonstrating its reliability and correlation with human judgment.
Contribution
The paper releases a new real-life speech mixture dataset and develops a blind neural SI-SNR estimator for performance evaluation without ground truth.
Findings
The SI-SNR estimator reliably evaluates real mixture separation performance.
The estimator's predictions correlate well with human opinions.
Performance trends on REAL-M match those on synthetic benchmarks.
Abstract
In recent years, deep learning based source separation has achieved impressive results. Most studies, however, still evaluate separation models on synthetic datasets, while the performance of state-of-the-art techniques on in-the-wild speech data remains an open question. This paper contributes to fill this gap in two ways. First, we release the REAL-M dataset, a crowd-sourced corpus of real-life mixtures. Secondly, we address the problem of performance evaluation of real-life mixtures, where the ground truth is not available. We bypass this issue by carefully designing a blind Scale-Invariant Signal-to-Noise Ratio (SI-SNR) neural estimator. Through a user study, we show that our estimator reliably evaluates the separation performance on real mixtures. The performance predictions of the SI-SNR estimator indeed correlate well with human opinions. Moreover, we observe that the performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
