Investigating Self-supervised Pretraining Frameworks for Pathological   Speech Recognition

Lester Phillip Violeta; Wen-Chin Huang; Tomoki Toda

arXiv:2203.15431·cs.SD·June 30, 2022

Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition

Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

PDF

Open Access

TL;DR

This study evaluates self-supervised pretraining frameworks for pathological speech recognition, finding that supervised pretraining still outperforms SSL methods on such datasets, highlighting the challenges of applying SSL in this domain.

Contribution

The paper compares SSL frameworks like wav2vec 2.0 and WavLM with supervised pretraining for pathological speech recognition, revealing limitations of SSL in this context.

Findings

01

Supervised pretraining outperforms SSL by 13.9% CER in electrolaryngeal speech.

02

Supervised pretraining outperforms SSL by 16.8% WER in dysarthric speech.

03

SSL frameworks do not show the same success with pathological speech as with healthy speech.

Abstract

We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy speech datasets and then fine-tuning it on the pathological speech datasets. One new pretraining framework called self-supervised learning (SSL) trains a network using only speech data, providing more flexibility in training data requirements and allowing more speech data to be used in pretraining. We investigate SSL frameworks such as the wav2vec 2.0 and WavLM models using different setups and compare their performance with different supervised pretraining setups, using two types of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research