ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping

Ivan Yakovlev; Anton Okhotnikov

arXiv:2603.11841·eess.AS·March 13, 2026

ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping

Ivan Yakovlev, Anton Okhotnikov

PDF

Open Access

TL;DR

ReDimNet2 is an advanced neural network architecture for speaker verification that incorporates time pooling and dimension reshaping, enabling scalable, efficient, and accurate speaker representation extraction across various model sizes.

Contribution

The paper introduces ReDimNet2, a novel scalable neural network architecture with time pooling for improved speaker verification performance and efficiency.

Findings

01

Achieves 0.287% EER on Vox1-O with 12.3M parameters.

02

Improves the Pareto front of cost versus accuracy across model scales.

03

Introduces seven model configurations with varying sizes and computational costs.

Abstract

We present ReDimNet2, an improved neural network architecture for extracting utterance-level speaker representations that builds upon the ReDimNet dimension-reshaping framework. The key modification in ReDimNet2 is the introduction of pooling over the time dimension within the 1D processing pathway. This operation preserves the nature of the 1D feature space, since 1D features remain a reshaped version of 2D features regardless of temporal resolution, while enabling significantly more aggressive scaling of the channel dimension without proportional compute increase. We introduce a family of seven model configurations (B0-B6) ranging from 1.1M to 12.3M parameters and 0.33 to 13 GMACS. Experimental results on VoxCeleb1 benchmarks demonstrate that ReDimNet2 improves the Pareto front of computational cost versus accuracy at every scale point compared to ReDimNet, achieving 0.287% EER on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing