What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Massa Baali; Sarthak Bisht; Rita Singh; Bhiksha Raj

arXiv:2603.24432·cs.SD·March 26, 2026

What and When to Learn: CURriculum Ranking Loss for Large-Scale Speaker Verification

Massa Baali, Sarthak Bisht, Rita Singh, Bhiksha Raj

PDF

Open Access

TL;DR

This paper introduces Curry, an adaptive curriculum-based loss function for large-scale speaker verification that dynamically ranks samples by difficulty, improving robustness and reducing error rates on challenging datasets.

Contribution

The paper proposes Curry, a novel online sample difficulty estimation method using Sub-center ArcFace, enabling adaptive learning without auxiliary annotations for large-scale speaker verification.

Findings

01

Curry reduces EER by 86.8% over baseline on VoxCeleb1-O.

02

Curry achieves a 60.0% reduction in EER on SITW.

03

This is the largest-scale speaker verification system trained to date.

Abstract

Speaker verification at large scale remains an open challenge as fixed-margin losses treat all samples equally regardless of quality. We hypothesize that mislabeled or degraded samples introduce noisy gradients that disrupt compact speaker manifolds. We propose Curry (CURriculum Ranking), an adaptive loss that estimates sample difficulty online via Sub-center ArcFace: confidence scores from dominant sub-center cosine similarity rank samples into easy, medium, and hard tiers using running batch statistics, without auxiliary annotations. Learnable weights guide the model from stable identity foundations through manifold refinement to boundary sharpening. To our knowledge, this is the largest-scale speaker verification system trained to date. Evaluated on VoxCeleb1-O, and SITW, Curry reduces EER by 86.8\% and 60.0\% over the Sub-center ArcFace baseline, establishing a new paradigm for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Imbalanced Data Classification Techniques