HPP-Voice: A Large-Scale Evaluation of Speech Embeddings for Multi-Phenotypic Classification

David Krongauz; Hido Pinto; Sarah Kohn; Yanir Marmor; Eran Segal

arXiv:2505.16490·eess.AS·May 27, 2025

HPP-Voice: A Large-Scale Evaluation of Speech Embeddings for Multi-Phenotypic Classification

David Krongauz, Hido Pinto, Sarah Kohn, Yanir Marmor, Eran Segal

PDF

Open Access

TL;DR

This study evaluates speech embeddings from a large Hebrew voice dataset for multi-phenotypic health classification, demonstrating that certain embeddings outperform traditional features in predicting medical conditions with gender-specific patterns.

Contribution

Introduces the HPP-Voice dataset and systematically compares 14 speech embedding models for health phenotype classification, revealing their effectiveness and gender-specific differences.

Findings

01

Speech embeddings outperform MFCCs and demographics in health classification.

02

Speaker identification embeddings predict sleep apnea with AUC of 0.64.

03

Gender influences model effectiveness across different medical conditions.

Abstract

Human speech contains paralinguistic cues that reflect a speaker's physiological and neurological state, potentially enabling non-invasive detection of various medical phenotypes. We introduce the Human Phenotype Project Voice corpus (HPP-Voice): a dataset of 7,188 recordings in which Hebrew-speaking adults count for 30 seconds, with each speaker linked to up to 15 potentially voice-related phenotypes spanning respiratory, sleep, mental health, metabolic, immune, and neurological conditions. We present a systematic comparison of 14 modern speech embedding models, where modern speech embeddings from these 30-second counting tasks outperform MFCCs and demographics for downstream health condition classifications. We found that embedding learned from a speaker identification model can predict objectively measured moderate to severe sleep apnea in males with an AUC of 0.64 $\pm$ 0.03, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis