# Analysis of Deep Clustering as Preprocessing for Automatic Speech   Recognition of Sparsely Overlapping Speech

**Authors:** Tobias Menne, Ilya Sklyar, Ralf Schl\"uter, Hermann Ney

arXiv: 1905.03500 · 2019-09-26

## TL;DR

This paper evaluates deep clustering (DPCL) as a preprocessing step for automatic speech recognition in scenarios with sparsely overlapping speech, proposing a new data simulation method and analyzing its effectiveness.

## Contribution

It introduces a data simulation approach for sparsely overlapping speech and analyzes DPCL's effectiveness as a preprocessing step in more realistic ASR scenarios.

## Key findings

- DPCL achieves 16.5% WER on wsj0-2mix dataset.
- Analysis highlights obstacles of applying DPCL to sparsely overlapping speech.
- Proposes a new dataset simulation method for realistic speech overlap scenarios.

## Abstract

Significant performance degradation of automatic speech recognition (ASR) systems is observed when the audio signal contains cross-talk. One of the recently proposed approaches to solve the problem of multi-speaker ASR is the deep clustering (DPCL) approach. Combining DPCL with a state-of-the-art hybrid acoustic model, we obtain a word error rate (WER) of 16.5 % on the commonly used wsj0-2mix dataset, which is the best performance reported thus far to the best of our knowledge. The wsj0-2mix dataset contains simulated cross-talk where the speech of multiple speakers overlaps for almost the entire utterance. In a more realistic ASR scenario the audio signal contains significant portions of single-speaker speech and only part of the signal contains speech of multiple competing speakers. This paper investigates obstacles of applying DPCL as a preprocessing method for ASR in such a scenario of sparsely overlapping speech. To this end we present a data simulation approach, closely related to the wsj0-2mix dataset, generating sparsely overlapping speech datasets of arbitrary overlap ratio. The analysis of applying DPCL to sparsely overlapping speech is an important interim step between the fully overlapping datasets like wsj0-2mix and more realistic ASR datasets, such as CHiME-5 or AMI.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.03500/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/1905.03500/full.md

## References

17 references — full list in the complete paper: https://tomesphere.com/paper/1905.03500/full.md

---
Source: https://tomesphere.com/paper/1905.03500