Harmonicity Plays a Critical Role in DNN Based Versus in   Biologically-Inspired Monaural Speech Segregation Systems

Rahil Parikh (1); Ilya Kavalerov (2); Carol Espy-Wilson (1); Shihab; Shamma (1) ((1) Institute for Systems Research; University of Maryland; (2); Google Inc.)

arXiv:2203.04420·eess.AS·March 10, 2022·1 cites

Harmonicity Plays a Critical Role in DNN Based Versus in Biologically-Inspired Monaural Speech Segregation Systems

Rahil Parikh (1), Ilya Kavalerov (2), Carol Espy-Wilson (1), Shihab, Shamma (1) ((1) Institute for Systems Research, University of Maryland, (2), Google Inc.)

PDF

Open Access

TL;DR

This paper investigates how harmonicity influences deep neural network models for speech segregation, revealing their high sensitivity to inharmonic speech and contrasting their mechanisms with biologically inspired algorithms.

Contribution

It demonstrates the critical role of harmonicity in DNN-based speech segregation and highlights their vulnerability to inharmonic speech, contrasting with biologically inspired methods.

Findings

01

Performance drops sharply with slight harmonic jittering.

02

Training on inharmonic speech worsens natural speech segregation.

03

DNN models differ from biologically inspired algorithms in segregation mechanisms.

Abstract

Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing