Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Sara Barahona; Anna Silnova; Ladislav Mo\v{s}ner; Junyi Peng; Old\v{r}ich Plchot; Johan Rohdin; Lin Zhang; Jiangyu Han; Petr Palka; Federico Landini; Luk\'a\v{s} Burget; Themos Stafylakis; Sandro Cumani; Dominik Bobo\v{s}; Miroslav Hlava\v{c}ek; Martin Kodovsky; Tom\'a\v{s} Pavl\'i\v{c}ek

arXiv:2505.15320·eess.AS·May 22, 2025

Analysis of ABC Frontend Audio Systems for the NIST-SRE24

Sara Barahona, Anna Silnova, Ladislav Mo\v{s}ner, Junyi Peng, Old\v{r}ich Plchot, Johan Rohdin, Lin Zhang, Jiangyu Han, Petr Palka, Federico Landini, Luk\'a\v{s} Burget, Themos Stafylakis, Sandro Cumani, Dominik Bobo\v{s}, Miroslav Hlava\v{c}ek, Martin Kodovsky

PDF

Open Access

TL;DR

This paper analyzes various embedding extractors for speaker recognition in NIST SRE 2024, comparing architectures and training conditions to develop robust, state-of-the-art frontends for conversational telephone speech.

Contribution

It introduces a comprehensive evaluation of ResNet, ReDimNet, and XLS-R architectures trained under fixed and open conditions for speaker embedding extraction.

Findings

01

VoxBlink-trained models show high robustness across languages.

02

ResNet and ReDimNet architectures perform competitively.

03

Open condition training with VoxBlink improves performance.

Abstract

We present a comprehensive analysis of the embedding extractors (frontends) developed by the ABC team for the audio track of NIST SRE 2024. We follow the two scenarios imposed by NIST: using only a provided set of telephone recordings for training (fixed) or adding publicly available data (open condition). Under these constraints, we develop the best possible speaker embedding extractors for the pre-dominant conversational telephone speech (CTS) domain. We explored architectures based on ResNet with different pooling mechanisms, recently introduced ReDimNet architecture, as well as a system based on the XLS-R model, which represents the family of large pre-trained self-supervised models. In open condition, we train on VoxBlink2 dataset, containing 110 thousand speakers across multiple languages. We observed a good performance and robustness of VoxBlink-trained models, and our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Image and Signal Denoising Methods

MethodsAverage Pooling · Convolution · Global Average Pooling · Kaiming Initialization · Max Pooling · Sparse Evolutionary Training · Approximate Bayesian Computation