Neural Scoring: A Refreshed End-to-End Approach for Speaker Recognition in Complex Conditions
Wan Lin, Junhui Chen, Tianhao Wang, Zhenyu Zhou, Lantian Li, Dong Wang

TL;DR
This paper introduces Neural Scoring, an end-to-end speaker verification framework that directly estimates verification probabilities, improving robustness in complex multi-talker scenarios and significantly reducing error rates.
Contribution
The paper presents Neural Scoring, a novel end-to-end approach that bypasses speaker embeddings, and introduces LtE2E training for efficient large-scale verification, enhancing performance in challenging conditions.
Findings
Neural Scoring outperforms baseline methods across various conditions.
Achieved 70.36% reduction in EER on VoxCeleb dataset.
Effective in multi-talker speech scenarios.
Abstract
Modern speaker verification systems primarily rely on speaker embeddings, followed by verification based on cosine similarity between the embedding vectors of the enrollment and test utterances. While effective, these methods struggle with multi-talker speech due to the unidentifiability of embedding vectors. In this paper, we propose Neural Scoring (NS), a refreshed end-to-end framework that directly estimates verification posterior probabilities without relying on test-side embeddings, making it more robust to complex conditions, e.g., with multiple talkers. To make the training of such an end-to-end model more efficient, we introduce a large-scale trial e2e training (LtE2E) strategy, where each test utterance pairs with a set of enrolled speakers, thus enabling the processing of large-scale verification trials per batch. Experiments on the VoxCeleb dataset demonstrate that NS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
