WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Rajath Rao; Adithya Ganesan; Oscar Kjell; Jonah Luby; Akshay Raghavan; Scott Feltman; Whitney Ringwald; Ryan L. Boyd; Benjamin Luft; Camilo Ruggero; Neville Ryant; Roman Kotov; H. Andrew Schwartz

arXiv:2501.16344·eess.AS·June 3, 2025

WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Rajath Rao, Adithya Ganesan, Oscar Kjell, Jonah Luby, Akshay Raghavan, Scott Feltman, Whitney Ringwald, Ryan L. Boyd, Benjamin Luft, Camilo Ruggero, Neville Ryant, Roman Kotov, H. Andrew Schwartz

PDF

Open Access 1 Repo

TL;DR

WhiSPA enhances speech encoding by aligning it with semantic and psychological embeddings using self-supervised contrastive learning, reducing reliance on external language models and improving performance on affective and psychological tasks.

Contribution

This work introduces WhiSPA, a novel speech encoder that integrates semantic and psychological alignment through contrastive learning with a teacher model, eliminating the need for a separate text-based language model.

Findings

01

Achieves 73.4% and 83.8% error reduction on affective and psychological tasks.

02

Surpasses current speech encoders in semantic and psychological alignment.

03

Demonstrates the effectiveness of self-supervised contrastive learning for speech representation.

Abstract

Current speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce WhiSPA (Whisper with Semantic and Psychological Alignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper's latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

humanlab/whispa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics in Business and Education · Legal Education and Practice Innovations

MethodsSentence-BERT