Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models
Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, \'Eva Sz\'ekely

TL;DR
This paper investigates how speech foundation models respond to variations in voice quality, such as creaky and breathy phonation, using new datasets and open-ended tasks to understand their sensitivity to paralinguistic features.
Contribution
It introduces a novel dataset with synthesized voice quality variations and evaluates SFM responses to these non-lexical speech features, filling a gap in current benchmarks.
Findings
SFMs show sensitivity to voice quality variations
Models' responses vary with phonation types in open-ended tasks
The dataset enables more nuanced evaluation of speech models
Abstract
Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
