Developing vocal system impaired patient-aimed voice quality assessment   approach using ASR representation-included multiple features

Shaoxiang Dang; Tetsuya Matsumoto; Yoshinori Takeuchi; Takashi Tsuboi,; Yasuhiro Tanaka; Daisuke Nakatsubo; Satoshi Maesawa; Ryuta Saito; Masahisa; Katsuno; and Hiroaki Kudo

arXiv:2408.12279·cs.SD·August 23, 2024

Developing vocal system impaired patient-aimed voice quality assessment approach using ASR representation-included multiple features

Shaoxiang Dang, Tetsuya Matsumoto, Yoshinori Takeuchi, Takashi Tsuboi,, Yasuhiro Tanaka, Daisuke Nakatsubo, Satoshi Maesawa, Ryuta Saito, Masahisa, Katsuno, and Hiroaki Kudo

PDF

Open Access

TL;DR

This paper introduces a novel voice quality assessment method for impaired patients using ASR-based features and self-supervised learning, demonstrating high accuracy and correlation in clinical datasets.

Contribution

It presents an innovative approach combining ASR representations and multiple features to improve voice quality assessment in clinical settings with limited data.

Findings

01

High correlation (>0.8 PCC) in voice quality prediction on PVQD dataset

02

Achieved accuracy (<0.5 MSE) in predicting voice indicators

03

Progress in assessing voice quality of Parkinson's patients post-DBS surgery

Abstract

The potential of deep learning in clinical speech processing is immense, yet the hurdles of limited and imbalanced clinical data samples loom large. This article addresses these challenges by showcasing the utilization of automatic speech recognition and self-supervised learning representations, pre-trained on extensive datasets of normal speech. This innovative approach aims to estimate voice quality of patients with impaired vocal systems. Experiments involve checks on PVQD dataset, covering various causes of vocal system damage in English, and a Japanese dataset focusing on patients with Parkinson's disease before and after undergoing subthalamic nucleus deep brain stimulation (STN-DBS) surgery. The results on PVQD reveal a notable correlation (>0.8 on PCC) and an extraordinary accuracy (<0.5 on MSE) in predicting Grade, Breathy, and Asthenic indicators. Meanwhile, progress has been…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVoice and Speech Disorders · Speech Recognition and Synthesis