Learning to detect dysarthria from raw speech
Juliette Millet, Neil Zeghidour

TL;DR
This paper introduces a neural network that learns feature extraction, normalization, and compression directly from raw speech to improve dysarthria detection accuracy, surpassing traditional fixed features and prior learned features.
Contribution
It presents the first approach to jointly learn feature extraction, normalization, and compression directly from raw audio for speech classification tasks.
Findings
10% absolute accuracy improvement over fixed mel-filterbank features
Outperforms OpenSmile features when jointly learned from raw speech
Effective joint learning of multiple preprocessing steps from raw audio
Abstract
Speech classifiers of paralinguistic traits traditionally learn from diverse hand-crafted low-level features, by selecting the relevant information for the task at hand. We explore an alternative to this selection, by learning jointly the classifier, and the feature extraction. Recent work on speech recognition has shown improved performance over speech features by learning from the waveform. We extend this approach to paralinguistic classification and propose a neural network that can learn a filterbank, a normalization factor and a compression power from the raw speech, jointly with the rest of the architecture. We apply this model to dysarthria detection from sentence-level audio recordings. Starting from a strong attention-based baseline on which mel-filterbanks outperform standard low-level descriptors, we show that learning the filters or the normalization and compression improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Music and Audio Processing
