An Empirical Analysis on the Vulnerabilities of End-to-End Speech   Segregation Models

Rahil Parikh; Gaspar Rochette; Carol Espy-Wilson; Shihab Shamma

arXiv:2206.09556·eess.AS·June 22, 2022

An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models

Rahil Parikh, Gaspar Rochette, Carol Espy-Wilson, Shihab Shamma

PDF

Open Access

TL;DR

This paper investigates the mechanisms of end-to-end speech segregation models, revealing their reliance on harmonic cues, their instability under certain conditions, and the impact of encoder design on their performance and robustness.

Contribution

It provides a detailed analysis of how ConvTasnet and DPT-Net perform harmonic analysis, identifies sources of errors, and suggests that replacing the encoder improves stability.

Findings

01

End-to-end models are highly unstable with imperceptible deformations.

02

Replacing the encoder with a spectrogram reduces performance but increases stability.

03

Harmonic cues are critical for speech segregation in these models.

Abstract

End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical cue for these networks to group sources, in this work, we perform a thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a harmonic analysis of the input mixture. We perform ablation studies where we apply low-pass, high-pass, and band-stop filters of varying pass-bands to empirically analyze the harmonics most critical for segregation. We also investigate how these networks decide which output channel to assign to an estimated source by introducing discontinuities in synthetic mixtures. We find that end-to-end networks are highly unstable, and perform poorly when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Acoustic Wave Phenomena Research

MethodsConvolutional time-domain audio separation network