Attention-based conditioning methods using variable frame rate for   style-robust speaker verification

Amber Afshan; Abeer Alwan

arXiv:2206.13680·eess.AS·June 29, 2022

Attention-based conditioning methods using variable frame rate for style-robust speaker verification

Amber Afshan, Abeer Alwan

PDF

Open Access

TL;DR

This paper introduces an entropy-based variable frame rate conditioning method for self-attention in speaker verification, improving robustness to speaking style variations across multiple datasets.

Contribution

It proposes a novel entropy-based conditioning vector for self-attention, enhancing speaker embedding robustness to style variations in text-independent verification.

Findings

01

Significant improvements over baseline in 12/23 tasks

02

Outperforms unconditioned self-attention in 9/23 tasks

03

Effective in multi-speaker scenarios like SITW

Abstract

We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using the bottleneck features as speaker representations. Such a network has a pooling layer to transform frame-level to utterance-level features by calculating statistics over all utterance frames, with equal weighting. However, self-attentive embeddings perform weighted pooling such that the weights correspond to the importance of the frames in a speaker classification task. Entropy can capture acoustic variability due to speaking style variations. Hence, an entropy-based variable frame rate vector is proposed as an external conditioning vector for the self-attention layer to provide the network with information that can address style effects. This work…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing