A Semi-spontaneous Dutch Speech Dataset for Speech Enhancement and Speech Recognition
Dimme de Groot, Yuanyuan Zhang, Jorge Martinez, Odette Scharenborg

TL;DR
This paper introduces DRES, a semi-spontaneous Dutch speech dataset recorded in noisy indoor environments, and evaluates the impact of speech enhancement on recognition performance using state-of-the-art models.
Contribution
The creation of DRES, a realistic Dutch speech dataset, and the comprehensive evaluation of speech enhancement and recognition models in real-world noisy conditions.
Findings
Five ASR models achieved WERs below 22% on DRES.
Modern single-channel speech enhancement did not improve ASR performance in realistic scenarios.
Evaluation in real-world conditions is crucial for assessing speech processing models.
Abstract
We present DRES: a 1.5-hour Dutch realistic elicited (semi-spontaneous) speech dataset from 80 speakers recorded in noisy, public indoor environments. DRES was designed as a test set for the evaluation of state-of-the-art (SOTA) automatic speech recognition (ASR) and speech enhancement (SE) models in a real-world scenario: a person speaking in a public indoor space with background talkers and noise. The speech was recorded with a four-channel linear microphone array. In this work we evaluate the speech quality of five well-known single-channel SE algorithms and the recognition performance of eight SOTA off-the-shelf ASR models before and after applying SE on the speech of DRES. We found that five out of the eight ASR models have WERs lower than 22% on DRES, despite the challenging conditions. In contrast to recent work, we did not find a positive effect of modern single-channel SE on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Face recognition and analysis
