Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

Jaya Narain; Vasudha Kowtha; Colin Lea; Lauren Tooley; Dianna Yee; Vikramjit Mitra; Zifang Huang; Miquel Espi Marques; Jon Huang; Carlos Avendano; Shirley Ren

arXiv:2505.21809·cs.SD·May 29, 2025

Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect

Jaya Narain, Vasudha Kowtha, Colin Lea, Lauren Tooley, Dianna Yee, Vikramjit Mitra, Zifang Huang, Miquel Espi Marques, Jon Huang, Carlos Avendano, Shirley Ren

PDF

Open Access

TL;DR

This paper develops interpretable voice quality models for atypical speech and affect, demonstrating strong performance and generalization across languages and tasks, useful for speaking style analysis.

Contribution

Introduces voice quality dimension probes trained on a large dataset, showing their effectiveness and interpretability for atypical speech and affective states.

Findings

01

Probes achieved high accuracy in modeling voice quality dimensions.

02

Models generalized well across different speech categories and languages.

03

Zero-shot evaluation confirmed robustness on unseen datasets.

Abstract

Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations. Here we develop and evaluate voice quality models for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes were trained on the public Speech Accessibility (SAP) project dataset with 11,184 samples from 434 speakers, using embeddings from frozen pre-trained models as features. We found that our probes had both strong performance and strong generalization across speech elicitation categories in the SAP dataset. We further validated zero-shot performance on additional datasets, encompassing unseen languages and tasks: Italian atypical speech, English atypical speech, and affective speech. The strong zero-shot performance and the interpretability of results across an array of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Phonetics and Phonology Research · Speech Recognition and Synthesis