Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders
Hok-Shing Lau, Mark Huntly, Nathon Morgan, Adesua Iyenoma, Biao Zeng,, Tim Bashford

TL;DR
This paper investigates how pretrained speech models, specifically Audio Spectrogram Transformers, make predictions for voice disorder detection by analyzing their attention mechanisms, revealing that fine-tuning concentrates attention on relevant phoneme regions.
Contribution
It introduces the use of attention rollout to interpret pretrained speech models in voice disorder detection and compares different configurations of Audio Spectrogram Transformers.
Findings
Attention becomes more focused on phoneme regions after fine-tuning.
Model relevance maps reveal how models make predictions under different conditions.
Fine-tuning reduces the spread of attention across the spectrogram.
Abstract
Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Cleft Lip and Palate Research
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dense Connections
