Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits
Tiantian Feng, Jihwan Lee, Anfeng Xu, Yoonjeong Lee, Thanathai Lertpetchpun, Xuan Shi, Helin Wang, Thomas Thebaud, Laureano Moro-Velazquez, Dani Byrd, Najim Dehak, Shrikanth Narayanan

TL;DR
Vox-Profile is a comprehensive benchmark for characterizing diverse speaker and speech traits using foundation models, enabling multi-dimensional profiling and supporting various speech analysis applications.
Contribution
It introduces a holistic, multi-dimensional speech trait benchmark grounded in speech science, developed with domain experts, and validated across multiple datasets and models.
Findings
Vox-Profile effectively characterizes static and dynamic speech traits.
It enhances analysis of ASR performance variability.
It evaluates speech generation systems with automated profiles.
Abstract
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗tiantiaf/wavlm-large-broader-accentmodel· 516 dl· ♡ 1516 dl♡ 1
- 🤗tiantiaf/wavlm-large-narrow-accentmodel· 552 dl· ♡ 1552 dl♡ 1
- 🤗tiantiaf/wavlm-large-categorical-emotionmodel· 606 dl· ♡ 5606 dl♡ 5
- 🤗tiantiaf/wavlm-large-speech-flowmodel· 508 dl· ♡ 1508 dl♡ 1
- 🤗tiantiaf/wavlm-large-voice-qualitymodel· 546 dl· ♡ 3546 dl♡ 3
- 🤗tiantiaf/wavlm-large-age-sexmodel· 2.1k dl· ♡ 82.1k dl♡ 8
- 🤗tiantiaf/whisper-large-v3-msp-podcast-emotionmodel· 1.5k dl· ♡ 51.5k dl♡ 5
- 🤗tiantiaf/whisper-large-v3-narrow-accentmodel· 1.3k dl· ♡ 41.3k dl♡ 4
- 🤗tiantiaf/whisper-large-v3-broad-accentmodel· 1.2k dl· ♡ 11.2k dl♡ 1
- 🤗tiantiaf/whisper-large-v3-speech-flowmodel· 1.2k dl· ♡ 11.2k dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
