Golden Gemini is All You Need: Finding the Sweet Spots for Speaker   Verification

Tianchi Liu; Kong Aik Lee; Qiongqiong Wang; Haizhou Li

arXiv:2312.03620·eess.AS·April 25, 2024·1 cites

Golden Gemini is All You Need: Finding the Sweet Spots for Speaker Verification

Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li

PDF

Open Access 1 Repo

TL;DR

This paper systematically explores stride configurations in ResNet models for speaker verification, identifying optimal settings called Golden Gemini that significantly improve performance and efficiency across multiple datasets.

Contribution

It introduces the Golden Gemini principle for designing 2D ResNet models tailored to speaker verification, leading to state-of-the-art results and a new benchmark.

Findings

01

Golden Gemini points improve EER and minDCF significantly

02

State-of-the-art ResNet baseline gains performance and efficiency

03

Golden Gemini is effective across various architectures and training conditions

Abstract

Previous studies demonstrate the impressive performance of residual neural networks (ResNet) in speaker verification. The ResNet models treat the time and frequency dimensions equally. They follow the default stride configuration designed for image recognition, where the horizontal and vertical axes exhibit similarities. This approach ignores the fact that time and frequency are asymmetric in speech representation. In this paper, we address this issue and look for optimal stride configurations specifically tailored for speaker verification. We represent the stride space on a trellis diagram, and conduct a systematic study on the impact of temporal and frequency resolutions on the performance and further identify two optimal points, namely Golden Gemini, which serves as a guiding principle for designing 2D ResNet-based speaker verification models. By following the principle, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wenet-e2e/wespeaker
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Convolution · Average Pooling · 1x1 Convolution · Global Average Pooling · Residual Block · Batch Normalization · Max Pooling · Kaiming Initialization