Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations
Wei Xia, John H.L. Hansen

TL;DR
This paper introduces a novel global context guided transformation approach for speaker representation learning, significantly improving speaker verification accuracy by modeling long-range dependencies with minimal additional computational cost.
Contribution
The study proposes a new global context guided channel and time-frequency transformation module that enhances speaker representations in CNN models, outperforming existing methods like Squeeze&Excitation.
Findings
Equal Error Rate reduced from 4.56% to 3.07%.
Achieved 32.68% relative EER reduction.
Improved DCF score by 27.28%.
Abstract
In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsAverage Pooling · 1x1 Convolution · Global Average Pooling · Kaiming Initialization · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Max Pooling · Residual Block · Bottleneck Residual Block
