SpeakerRPL v2: Robust Open-set Speaker Identification through Enhanced Few-shot Foundation Tuning and Model Fusion
Zhiyong Chen, Shuhang Wu, Yingjie Duan, Xinkang Xu, Xinhui Hu

TL;DR
This paper introduces SpeakerRPL v2, an improved open-set speaker identification method that enhances robustness and generalization through advanced training objectives, model fusion, and selection strategies, validated on multiple datasets.
Contribution
It presents novel integration of reciprocal points learning with LogitNorm and adaptive anchor learning, along with a model fusion strategy and selection method for better open-set speaker identification.
Findings
Reduces EER from 1.28% to 0.09% on Vox1-O-like test set
Demonstrates robustness across VoxCeleb, ESD, and 3D-Speaker datasets
Improves stability and generalization in few-shot tuning
Abstract
This paper proposes an improved approach for open-set speaker identification based on pretrained speaker foundation models. Building upon the previous Speaker Reciprocal Points Learning framework (V1), we first introduce an enhanced open-set learning objective by integrating reciprocal points learning with logit normalization (LogitNorm) and incorporating adaptive anchor learning to better constrain target speaker representations and improve robustness. Second, we propose a model fusion strategy to stabilize and enhance the few-shot tuning process, effectively reducing result randomness and improving generalization. Furthermore, we introduce a model selection method to ensure optimal performance in model fusion. Experimental evaluations on the VoxCeleb, ESD and 3D-Speaker datasets demonstrate the effectiveness and robustness of the proposed method under diverse conditions. On a newly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
