Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition
Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi, Ye, Zengrui Jin, Xunying Liu, Helen Meng

TL;DR
This paper introduces spectro-temporal deep features derived from SVD decomposition of speech spectra to improve disordered speech recognition and assessment, demonstrating significant WER reductions on the UASpeech corpus.
Contribution
It proposes novel spectro-temporal subspace basis embedding deep features for disordered speech recognition and speaker adaptation, outperforming traditional i-Vector methods.
Findings
Achieved up to 8.6% relative WER reduction over baseline
Consistent improvements with data augmentation and LHUC adaptation
Final system attained 25.6% WER on UASpeech test set
Abstract
Automatic recognition of disordered speech remains a highly challenging task to date. Sources of variability commonly found in normal speech including accent, age or gender, when further compounded with the underlying causes of speech impairment and varying severity levels, create large diversity among speakers. To this end, speaker adaptation techniques play a vital role in current speech recognition systems. Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed to facilitate both accurate speech intelligibility assessment and auxiliary feature based speaker adaptation of state-of-the-art hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
