TL;DR
This paper introduces a three-stage data augmentation framework leveraging unlabeled and large-scale typical speech datasets to improve the robustness of dysarthric speech quality assessment models, outperforming state-of-the-art methods.
Contribution
A novel semi-supervised framework using pseudo-labeling and contrastive learning for robust dysarthric speech assessment across diverse datasets.
Findings
Pretrained model outperforms SOTA predictors like SpICE.
Achieves an average SRCC of 0.761 on unseen datasets.
Demonstrates robustness across multiple etiologies and languages.
Abstract
Dysarthric speech quality assessment (DSQA) is critical for clinical diagnostics and inclusive speech technologies. However, subjective evaluation is costly and difficult to scale, and the scarcity of labeled data limits robust objective modeling. To address this, we propose a three-stage framework that leverages unlabeled dysarthric speech and large-scale typical speech datasets to scale training. A teacher model first generates pseudo-labels for unlabeled samples, followed by weakly supervised pretraining using a label-aware contrastive learning strategy that exposes the model to diverse speakers and acoustic conditions. The pretrained model is then fine-tuned for the downstream DSQA task. Experiments on five unseen datasets spanning multiple etiologies and languages demonstrate the robustness of our approach. Our Whisper-based baseline significantly outperforms SOTA DSQA predictors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
