Interpreting Pretrained Speech Models for Automatic Speech Assessment of   Voice Disorders

Hok-Shing Lau; Mark Huntly; Nathon Morgan; Adesua Iyenoma; Biao Zeng,; Tim Bashford

arXiv:2407.00531·cs.SD·July 2, 2024

Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders

Hok-Shing Lau, Mark Huntly, Nathon Morgan, Adesua Iyenoma, Biao Zeng,, Tim Bashford

PDF

Open Access

TL;DR

This paper investigates how pretrained speech models, specifically Audio Spectrogram Transformers, make predictions for voice disorder detection by analyzing their attention mechanisms, revealing that fine-tuning concentrates attention on relevant phoneme regions.

Contribution

It introduces the use of attention rollout to interpret pretrained speech models in voice disorder detection and compares different configurations of Audio Spectrogram Transformers.

Findings

01

Attention becomes more focused on phoneme regions after fine-tuning.

02

Model relevance maps reveal how models make predictions under different conditions.

03

Fine-tuning reduces the spread of attention across the spectrogram.

Abstract

Speech contains information that is clinically relevant to some diseases, which has the potential to be used for health assessment. Recent work shows an interest in applying deep learning algorithms, especially pretrained large speech models to the applications of Automatic Speech Assessment. One question that has not been explored is how these models output the results based on their inputs. In this work, we train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection and apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions. We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned, and the model attention is concentrated on specific phoneme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Cleft Lip and Palate Research

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Adam · Dense Connections