Pre-trained Speech Processing Models Contain Human-Like Biases that Propagate to Speech Emotion Recognition
Isaac Slaughter, Craig Greenberg, Reva Schwartz, Aylin Caliskan

TL;DR
This paper introduces the SpEAT, a method to detect human-like biases in pre-trained speech models, revealing that these biases can influence speech emotion recognition outcomes.
Contribution
The study develops the SpEAT to quantify biases in speech models and demonstrates their presence and impact on downstream emotion recognition tasks.
Findings
Most models show positive valence biases towards certain social groups.
Biases in pre-trained models often propagate to emotion recognition results.
Pre-trained speech models frequently learn and reflect human-like biases.
Abstract
Previous work has established that a person's demographics and speech style affect how well speech processing models perform for them. But where does this bias come from? In this work, we present the Speech Embedding Association Test (SpEAT), a method for detecting bias in one type of model used for many speech tasks: pre-trained models. The SpEAT is inspired by word embedding association tests in natural language processing, which quantify intrinsic bias in a model's representations of different concepts, such as race or valence (something's pleasantness or unpleasantness) and capture the extent to which a model trained on large-scale socio-cultural data has learned human-like biases. Using the SpEAT, we test for six types of bias in 16 English speech models (including 4 models also trained on multilingual data), which come from the wav2vec 2.0, HuBERT, WavLM, and Whisper model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
