Cross-Learning Fine-Tuning Strategy for Dysarthric Speech Recognition Via CDSD database
Qing Xiao, Yingshan Peng, PeiPei Zhang

TL;DR
This paper introduces a multi-speaker fine-tuning approach for dysarthric speech recognition that improves accuracy and generalization by leveraging broader pathological features, outperforming traditional single-speaker methods.
Contribution
It proposes a novel cross-learning fine-tuning strategy that enhances dysarthric speech recognition by training on multiple speakers simultaneously, reducing data dependence and overfitting.
Findings
Up to 13.15% lower WER compared to single-speaker fine-tuning.
Multi-speaker fine-tuning improves generalization and target-speaker accuracy.
The approach mitigates speaker-specific overfitting and reduces data requirements.
Abstract
Dysarthric speech recognition faces challenges from severity variations and disparities relative to normal speech. Conventional approaches individually fine-tune ASR models pre-trained on normal speech per patient to prevent feature conflicts. Counter-intuitively, experiments reveal that multi-speaker fine-tuning (simultaneously on multiple dysarthric speakers) improves recognition of individual speech patterns. This strategy enhances generalization via broader pathological feature learning, mitigates speaker-specific overfitting, reduces per-patient data dependence, and improves target-speaker accuracy - achieving up to 13.15% lower WER versus single-speaker fine-tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
