Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

Harm Lameris; Shree Harsha Bokkahalli Satish; Joakim Gustafson; \'Eva Sz\'ekely

arXiv:2510.25577·eess.AS·October 30, 2025

Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models

Harm Lameris, Shree Harsha Bokkahalli Satish, Joakim Gustafson, \'Eva Sz\'ekely

PDF

TL;DR

This paper investigates how speech foundation models respond to variations in voice quality, such as creaky and breathy phonation, using new datasets and open-ended tasks to understand their sensitivity to paralinguistic features.

Contribution

It introduces a novel dataset with synthesized voice quality variations and evaluates SFM responses to these non-lexical speech features, filling a gap in current benchmarks.

Findings

01

SFMs show sensitivity to voice quality variations

02

Models' responses vary with phonation types in open-ended tasks

03

The dataset enables more nuanced evaluation of speech models

Abstract

Recent advances in speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations. This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal. One under-explored dimension of paralinguistic variation is voice quality, encompassing phonation types such as creaky and breathy voice. These phonation types are known to influence how listeners infer affective state, stance and social meaning in speech. Existing benchmarks for speech understanding largely rely on multiple-choice question answering (MCQA) formats, which are prone to failure and therefore unreliable in capturing the nuanced ways paralinguistic features influence model behaviour. In this paper, we probe SFMs through open-ended generation tasks and speech emotion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.