U3-xi: Pushing the Boundaries of Speaker Recognition by Incorporating Uncertainty
Junjie Li, Kong Aik Lee

TL;DR
This paper introduces U3-xi, a novel framework that enhances speaker recognition by estimating and utilizing frame-level uncertainty to improve the reliability and interpretability of speaker embeddings, leading to significant performance gains.
Contribution
U3-xi is the first framework to incorporate comprehensive uncertainty estimation into speaker recognition, combining multiple supervision strategies and a Transformer-based module for improved accuracy.
Findings
Achieves 21.1% relative EER reduction on VoxCeleb1
Attains 15.57% relative minDCF improvement
Demonstrates model-agnostic applicability and robustness
Abstract
An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling
