Cross-Talk Speech Reduction, by Separation, for Separation
Zhong-Qiu Wang, Samuele Cornell

TL;DR
This paper introduces CTRnet and PuLSS, novel neural methods for reducing cross-talk and separating speech in real conversational recordings, improving speech recognition performance in real-world scenarios.
Contribution
The paper presents a new framework for training speech separation models directly on real data using cross-talk reduction and pseudo-labels, addressing domain generalization issues.
Findings
Achieves state-of-the-art ASR performance on CHiME-6 dataset.
Outperforms previous methods on real conversational speech data.
First neural separation method to surpass guided source separation in real-world conditions.
Abstract
In conversational speech separation and recognition tasks, close-talk microphones are typically attached to each speaker during training data collection to capture near-field, close-talk mixture signals, in addition to using far-field microphones to record far-field mixture signals. Each such close-talk mixture exhibits a reasonably high energy level for the wearer and could intuitively serve as weak supervision for training far-field speech separation models directly on real-recorded far-field signals. However, they are not sufficiently clean for this purpose, as they often contain strong cross-talk speech from other speakers in addition to background noise. To address this, we propose cross-talk reduction (CTR), a task aiming to isolate the wearer's speech from each close-talk mixture, and a novel method called CTRnet, which can be trained directly on real-recorded pairs of close-talk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
