Self-supervised ASR Models and Features For Dysarthric and Elderly   Speech Recognition

Shujie Hu; Xurong Xie; Mengzhe Geng; Zengrui Jin; Jiajun Deng; Guinan; Li; Yi Wang; Mingyu Cui; Tianzi Wang; Helen Meng; Xunying Liu

arXiv:2407.13782·eess.AS·July 22, 2024

Self-supervised ASR Models and Features For Dysarthric and Elderly Speech Recognition

Shujie Hu, Xurong Xie, Mengzhe Geng, Zengrui Jin, Jiajun Deng, Guinan, Li, Yi Wang, Mingyu Cui, Tianzi Wang, Helen Meng, Xunying Liu

PDF

Open Access

TL;DR

This paper investigates integrating domain fine-tuned self-supervised speech models into ASR systems to improve recognition accuracy for dysarthric and elderly speech, addressing data scarcity and mismatch issues.

Contribution

It introduces methods for combining SSL features with traditional ASR systems and demonstrates significant performance improvements across multiple dysarthric and elderly speech datasets.

Findings

01

Significant WER/CER reductions across all datasets

02

Improved Alzheimer's detection accuracy

03

Effective multi-modal ASR system development

Abstract

Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcity and mismatch. To this end, this paper explores a series of approaches to integrate domain fine-tuned SSL pre-trained models and their features into TDNN and Conformer ASR systems for dysarthric and elderly speech recognition. These include: a) input feature fusion between standard acoustic frontends and domain fine-tuned SSL speech representations; b) frame-level joint decoding between TDNN systems separately trained using standard acoustic features alone and those with additional domain fine-tuned SSL features; and c) multi-pass decoding involving the TDNN/Conformer system outputs to be rescored using domain fine-tuned pre-trained ASR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders

MethodsXLSR