Robustness of Speech Separation Models for Similar-pitch Speakers
Bunlong Lay, Sebastian Zaczek, Kristina Tesch, Timo Gerkmann

TL;DR
This paper evaluates how well recent speech separation neural networks perform when speakers have similar pitches, revealing that while modern models handle matched conditions well, they struggle with unseen, similar-pitch scenarios, highlighting areas for future improvement.
Contribution
The study extends analysis of speech separation robustness to recent neural network models under similar-pitch conditions, identifying persistent challenges and gaps in generalizability.
Findings
Modern models reduce performance gap in matched conditions.
Significant performance drop occurs with similar-pitch speakers in mismatched scenarios.
Performance remains strong for large pitch differences, weakens for similar pitches.
Abstract
Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal. Building on earlier findings by Ditter and Gerkmann, which identified a significant performance drop for the 2018 Chimera++ under similar-pitch conditions, our study extends the analysis to more recent and sophisticated Neural Network models. Our experiments reveal that modern models have substantially reduced the performance gap for matched training and testing conditions. However, a substantial performance gap persists under mismatched conditions, with models performing well for large pitch differences but showing worse performance if the speakers' pitches are similar. These findings motivate further research into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
