U3-xi: Pushing the Boundaries of Speaker Recognition by Incorporating Uncertainty

Junjie Li; Kong Aik Lee

arXiv:2601.15719·cs.SD·March 25, 2026

U3-xi: Pushing the Boundaries of Speaker Recognition by Incorporating Uncertainty

Junjie Li, Kong Aik Lee

PDF

Open Access

TL;DR

This paper introduces U3-xi, a novel framework that enhances speaker recognition by estimating and utilizing frame-level uncertainty to improve the reliability and interpretability of speaker embeddings, leading to significant performance gains.

Contribution

U3-xi is the first framework to incorporate comprehensive uncertainty estimation into speaker recognition, combining multiple supervision strategies and a Transformer-based module for improved accuracy.

Findings

01

Achieves 21.1% relative EER reduction on VoxCeleb1

02

Attains 15.57% relative minDCF improvement

03

Demonstrates model-agnostic applicability and robustness

Abstract

An utterance-level speaker embedding is typically obtained by aggregating a sequence of frame-level representations. However, in real-world scenarios, individual frames encode not only speaker-relevant information but also various nuisance factors. As a result, different frames contribute unequally to the final utterance-level speaker representation for Automatic Speaker Verification systems. To address this issue, we propose to estimate the inherent uncertainty of each frame and assign adaptive weights accordingly, where frames with higher uncertainty receive lower attention. Based on this idea, we present U3-xi, a comprehensive framework designed to produce more reliable and interpretable uncertainty estimates for speaker embeddings. Specifically, we introduce several strategies for uncertainty supervision. First, we propose speaker-level uncertainty supervision via a Stochastic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Topic Modeling