Investigating Self-supervised Pretraining Frameworks for Pathological Speech Recognition
Lester Phillip Violeta, Wen-Chin Huang, Tomoki Toda

TL;DR
This study evaluates self-supervised pretraining frameworks for pathological speech recognition, finding that supervised pretraining still outperforms SSL methods on such datasets, highlighting the challenges of applying SSL in this domain.
Contribution
The paper compares SSL frameworks like wav2vec 2.0 and WavLM with supervised pretraining for pathological speech recognition, revealing limitations of SSL in this context.
Findings
Supervised pretraining outperforms SSL by 13.9% CER in electrolaryngeal speech.
Supervised pretraining outperforms SSL by 16.8% WER in dysarthric speech.
SSL frameworks do not show the same success with pathological speech as with healthy speech.
Abstract
We investigate the performance of self-supervised pretraining frameworks on pathological speech datasets used for automatic speech recognition (ASR). Modern end-to-end models require thousands of hours of data to train well, but only a small number of pathological speech datasets are publicly available. A proven solution to this problem is by first pretraining the model on a huge number of healthy speech datasets and then fine-tuning it on the pathological speech datasets. One new pretraining framework called self-supervised learning (SSL) trains a network using only speech data, providing more flexibility in training data requirements and allowing more speech data to be used in pretraining. We investigate SSL frameworks such as the wav2vec 2.0 and WavLM models using different setups and compare their performance with different supervised pretraining setups, using two types of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research
