An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models
Rahil Parikh, Gaspar Rochette, Carol Espy-Wilson, Shihab Shamma

TL;DR
This paper investigates the mechanisms of end-to-end speech segregation models, revealing their reliance on harmonic cues, their instability under certain conditions, and the impact of encoder design on their performance and robustness.
Contribution
It provides a detailed analysis of how ConvTasnet and DPT-Net perform harmonic analysis, identifies sources of errors, and suggests that replacing the encoder improves stability.
Findings
End-to-end models are highly unstable with imperceptible deformations.
Replacing the encoder with a spectrogram reduces performance but increases stability.
Harmonic cues are critical for speech segregation in these models.
Abstract
End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Acoustic Wave Phenomena Research
MethodsConvolutional time-domain audio separation network
