Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in wav2vec 2.0
Natalie Engert, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

TL;DR
This study investigates which components of wav2vec 2.0's learned representations are most effective for predicting various dysarthric speech descriptors, revealing different optimal strategies for different features.
Contribution
It systematically compares layer-wise and time-wise aggregation strategies in wav2vec 2.0 for dysarthric speech analysis, identifying the most informative representations for specific descriptors.
Findings
Intelligibility is best captured by layer-wise representations.
Imprecise consonants, harsh voice, and monoloudness benefit from time-wise modeling.
No clear advantage for either approach in inappropriate silences.
Abstract
Wav2vec 2.0 (W2V2) has shown strong performance in pathological speech analysis by effectively capturing the characteristics of atypical speech. Despite its success, it remains unclear which components of its learned representations are most informative for specific downstream tasks. In this study, we address this question by investigating the regression of dysarthric speech descriptors using annotations from the Speech Accessibility Project dataset. We focus on five descriptors, each addressing a different aspect of speech or voice production: intelligibility, imprecise consonants, inappropriate silences, harsh voice and monoloudness. Speech representations are derived from a W2V2-based feature extractor, and we systematically compare layer-wise and time-wise aggregation strategies using attentive statistics pooling. Our results show that intelligibility is best captured through…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
