Towards Solving Cocktail-Party: The First Method to Build a Realistic   Dataset with Ground Truths for Speech Separation

Rawad Melhem; Assef Jafar; Oumayma Al Dakkak

arXiv:2305.15758·cs.SD·August 29, 2024·1 cites

Towards Solving Cocktail-Party: The First Method to Build a Realistic Dataset with Ground Truths for Speech Separation

Rawad Melhem, Assef Jafar, Oumayma Al Dakkak

PDF

Open Access

TL;DR

This paper introduces a novel method for creating a realistic speech separation dataset with ground truths by recording two speakers simultaneously, enabling better training and benchmarking of deep learning models in real-world scenarios.

Contribution

The paper presents a new approach to generate realistic speech separation datasets with ground truths, addressing the challenge of obtaining accurate source signals in natural environments.

Findings

01

Improved SI-SDR by 1.65 dB using the new dataset

02

Enhanced PESQ score by approximately 0.5

03

Method increased model stability at various microphone-speaker distances

Abstract

Speech separation is very important in real-world applications such as human-machine interaction, hearing aids devices, and automatic meeting transcription. In recent years, a significant improvement occurred towards the solution based on deep learning. In fact, much attention has been drawn to supervised learning methods using synthetic mixtures datasets despite their being not representative of real-world mixtures. The difficulty in building a realistic dataset led researchers to use unsupervised learning methods, because of their ability to handle realistic mixtures directly. The results of unsupervised learning methods are still unconvincing. In this paper, a method is introduced to create a realistic dataset with ground truth sources for speech separation. The main challenge in designing a realistic dataset is the unavailability of ground truths for speakers signals. To address…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing