Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification
Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

TL;DR
This paper systematically explores stride configurations in ResNet models for speaker verification, identifying optimal settings called Golden Gemini that significantly improve performance and efficiency across multiple datasets.
Contribution
It introduces the Golden Gemini principle for designing 2D ResNet models tailored to speaker verification, leading to state-of-the-art results and a new benchmark.
Findings
Golden Gemini points improve EER and minDCF significantly
State-of-the-art ResNet baseline gains performance and efficiency
Golden Gemini is effective across various architectures and training conditions
Abstract
Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Convolution · Average Pooling · 1x1 Convolution · Global Average Pooling · Residual Block · Batch Normalization · Max Pooling · Kaiming Initialization
