Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection
Jianwei Zhang, Julie Liss, Suren Jayasuriya, and Visar Berisha

TL;DR
This paper introduces a deep learning framework that generates vocal quality embeddings for dysphonic voice detection, achieving high accuracy and robustness across different datasets and deteriorated conditions.
Contribution
It proposes a novel contrastive and classification loss combined deep learning model with data warping for robust vocal quality feature embeddings.
Findings
High in-corpus and cross-corpus classification accuracy
Embeddings sensitive to voice quality and robust across datasets
Consistently outperforms baseline methods on various datasets
Abstract
Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVoice and Speech Disorders · Speech Recognition and Synthesis · Music and Audio Processing
Methodsfail
