Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Whenty Ariyanti; Kuan-Yu Chen; Sabato Marco Siniscalchi; Hsin-Min Wang; Yu Tsao

arXiv:2505.21356·cs.SD·December 12, 2025

Towards Robust Assessment of Pathological Voices via Combined Low-Level Descriptors and Foundation Model Representations

Whenty Ariyanti, Kuan-Yu Chen, Sabato Marco Siniscalchi, Hsin-Min Wang, Yu Tsao

PDF

Open Access

TL;DR

This paper presents VOQANet and VOQANet+, deep learning models that combine foundation model embeddings and low-level acoustic features to improve objective assessment of pathological voices, outperforming traditional methods and showing robustness in noisy conditions.

Contribution

Introduction of VOQANet+ that integrates foundation model embeddings with low-level descriptors for enhanced, robust voice quality assessment.

Findings

01

VOQANet outperforms baseline models in RMSE and correlation.

02

Sentence-level inputs improve accuracy over vowel-level.

03

VOQANet+ maintains performance under noisy conditions.

Abstract

Perceptual voice quality assessment plays a vital role in diagnosing and monitoring voice disorders. Traditional methods, such as the Consensus Auditory-Perceptual Evaluation of Voice (CAPE-V) and the Grade, Roughness, Breathiness, Asthenia, and Strain (GRBAS) scales, rely on expert raters and are prone to inter-rater variability, emphasizing the need for objective solutions. This study introduces the Voice Quality Assessment Network (VOQANet), a deep learning framework that employs an attention mechanism and Speech Foundation Model (SFM) embeddings to extract high-level features. To further enhance performance, we propose VOQANet+, which integrates self-supervised SFM embeddings with low-level acoustic descriptors-namely jitter, shimmer, and harmonics-to-noise ratio (HNR). Unlike previous approaches that focus solely on vowel-based phonation (PVQD-A), our models are evaluated on both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need