Analysis of ABC Frontend Audio Systems for the NIST-SRE24
Sara Barahona, Anna Silnova, Ladislav Mo\v{s}ner, Junyi Peng, Old\v{r}ich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Luk\'a\v{s} Burget, Themos Stafylakis, Sandro Cumani, Dominik Bobo\v{s}, Miroslav Hlava\v{c}ek, Martin Kodovsky

TL;DR
This paper analyzes various embedding extractors for speaker recognition in NIST SRE 2024, comparing architectures and training conditions to develop robust, state-of-the-art frontends for conversational telephone speech.
Contribution
It introduces a comprehensive evaluation of ResNet, ReDimNet, and XLS-R architectures trained under fixed and open conditions for speaker embedding extraction.
Findings
VoxBlink-trained models show high robustness across languages.
ResNet and ReDimNet architectures perform competitively.
Open condition training with VoxBlink improves performance.
Abstract
We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Image and Signal Denoising Methods
MethodsAverage Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling · Sparse Evolutionary Training · Approximate Bayesian Computation
