Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End   Speech Recognition Systems

Tanvina Patel; Odette Scharenborg

arXiv:2307.02009·cs.CL·July 6, 2023·1 cites

Using Data Augmentations and VTLN to Reduce Bias in Dutch End-to-End Speech Recognition Systems

Tanvina Patel, Odette Scharenborg

PDF

Open Access

TL;DR

This study employs data augmentation and VTLN to mitigate bias in Dutch end-to-end speech recognition, achieving notable reductions in error rates and bias across diverse speaker groups, with cross-language benefits demonstrated.

Contribution

It introduces a combined approach of data augmentation and VTLN for bias reduction in end-to-end speech recognition systems, demonstrating effectiveness across multiple speaker demographics and languages.

Findings

01

Reduced average WER by 6.9% with combined techniques.

02

Bias across speaker groups decreased by 3.9%.

03

VTLN trained on Dutch improved Mandarin child speech recognition.

Abstract

Speech technology has improved greatly for norm speakers, i.e., adult native speakers of a language without speech impediments or strong accents. However, non-norm or diverse speaker groups show a distinct performance gap with norm speakers, which we refer to as bias. In this work, we aim to reduce bias against different age groups and non-native speakers of Dutch. For an end-to-end (E2E) ASR system, we use state-of-the-art speed perturbation and spectral augmentation as data augmentation techniques and explore Vocal Tract Length Normalization (VTLN) to normalise for spectral differences due to differences in anatomy. The combination of data augmentation and VTLN reduced the average WER and bias across various diverse speaker groups by 6.9% and 3.9%, respectively. The VTLN model trained on Dutch was also effective in improving performance of Mandarin Chinese child speech, thus, showing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings