Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features
Eliana Pastor, Alkis Koudounas, Giuseppe Attanasio, Dirk Hovy, Elena, Baralis

TL;DR
This paper introduces a novel method for explaining speech classification models by analyzing word-level audio segments and paralinguistic features, making model decisions more interpretable for users.
Contribution
It presents a new input perturbation approach for generating interpretable explanations at both word and paralinguistic levels in speech models.
Findings
Explanations are faithful to model inner workings.
Explanations are plausible and understandable to humans.
Method validated on English and Italian speech tasks.
Abstract
Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language understanding (SLU) tasks, and explanations are difficult to interpret for most users. We introduce a new approach to explain speech classification models. We generate easy-to-interpret explanations via input perturbation on two information levels. 1) Word-level explanations reveal how each word-related audio segment impacts the outcome. 2) Paralinguistic features (e.g., prosody and background noise) answer the counterfactual: ``What would the model prediction be if we edited the audio signal in this way?'' We validate our approach by explaining two state-of-the-art SLU models on two speech classification tasks in English and Italian. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
