Speaker Representation Learning using Global Context Guided Channel and   Time-Frequency Transformations

Wei Xia; John H.L. Hansen

arXiv:2009.00768·eess.AS·September 10, 2020

Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations

Wei Xia, John H.L. Hansen

PDF

Open Access

TL;DR

This paper introduces a novel global context guided transformation approach for speaker representation learning, significantly improving speaker verification accuracy by modeling long-range dependencies with minimal additional computational cost.

Contribution

The study proposes a new global context guided channel and time-frequency transformation module that enhances speaker representations in CNN models, outperforming existing methods like Squeeze&Excitation.

Findings

01

Equal Error Rate reduced from 4.56% to 3.07%.

02

Achieved 32.68% relative EER reduction.

03

Improved DCF score by 27.28%.

Abstract

In this study, we propose the global context guided channel and time-frequency transformations to model the long-range, non-local time-frequency dependencies and channel variances in speaker representations. We use the global context information to enhance important channels and recalibrate salient time-frequency locations by computing the similarity between the global context and local features. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset, which is a large scale speaker verification corpus collected in the wild. This lightweight block can be easily incorporated into a CNN model with little additional computational costs and effectively improves the speaker verification performance compared to the baseline ResNet-LDE model and the Squeeze&Excitation block by a large margin. Detailed ablation studies are also performed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsAverage Pooling · 1x1 Convolution · Global Average Pooling · Kaiming Initialization · Batch Normalization · *Communicated@Fast*How Do I Communicate to Expedia? · Residual Connection · Max Pooling · Residual Block · Bottleneck Residual Block