ReDimNet2: Scaling Speaker Verification via Time-Pooled Dimension Reshaping
Ivan Yakovlev, Anton Okhotnikov

TL;DR
ReDimNet2 is an advanced neural network architecture for speaker verification that incorporates time pooling and dimension reshaping, enabling scalable, efficient, and accurate speaker representation extraction across various model sizes.
Contribution
The paper introduces ReDimNet2, a novel scalable neural network architecture with time pooling for improved speaker verification performance and efficiency.
Findings
Achieves 0.287% EER on Vox1-O with 12.3M parameters.
Improves the Pareto front of cost versus accuracy across model scales.
Introduces seven model configurations with varying sizes and computational costs.
Abstract
We present ReDimNet2, an improved neural network architecture for extracting utterance-level speaker representations that builds upon the ReDimNet dimension-reshaping framework. The key modification in ReDimNet2 is the introduction of pooling over the time dimension within the 1D processing pathway. This operation preserves the nature of the 1D feature space, since 1D features remain a reshaped version of 2D features regardless of temporal resolution, while enabling significantly more aggressive scaling of the channel dimension without proportional compute increase. We introduce a family of seven model configurations (B0-B6) ranging from 1.1M to 12.3M parameters and 0.33 to 13 GMACS. Experimental results on VoxCeleb1 benchmarks demonstrate that ReDimNet2 improves the Pareto front of computational cost versus accuracy at every scale point compared to ReDimNet, achieving 0.287% EER on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and Audio Processing
