A Semi-Supervised Framework for Speech Confidence Detection using Whisper
Adam Wynn, Jingyun Wang

TL;DR
This paper introduces a semi-supervised framework combining deep semantic embeddings and acoustic features for speaker confidence detection, achieving superior performance with limited labeled data.
Contribution
It presents a novel hybrid semi-supervised approach with an uncertainty-aware pseudo-labeling strategy that outperforms existing self-supervised models and unimodal baselines.
Findings
Achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines.
Surpasses the unimodal Whisper baseline with a 3% improvement in minority class.
High confidence pseudo-labels improve data quality over quantity.
Abstract
Automatic detection of speaker confidence is critical for adaptive computing but remains constrained by limited labelled data and the subjectivity of paralinguistic annotations. This paper proposes a semi-supervised hybrid framework that fuses deep semantic embeddings from the Whisper encoder with an interpretable acoustic feature vector composed of eGeMAPS descriptors and auxiliary probability estimates of vocal stress and disfluency. To mitigate reliance on scarce ground truth data, we introduce an Uncertainty-Aware Pseudo-Labelling strategy where a model generates labels for unlabelled data, retaining only high-quality samples for training. Experimental results demonstrate that the proposed approach achieves a Macro-F1 score of 0.751, outperforming self-supervised baselines, including WavLM, HuBERT, and Wav2Vec 2.0. The hybrid architecture also surpasses the unimodal Whisper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
