Variable frame rate-based data augmentation to handle speaking-style   variability for automatic speaker verification

Amber Afshan; Jinxi Guo; Soo Jin Park; Vijay Ravi; Alan McCree; and; Abeer Alwan

arXiv:2008.03616·eess.AS·August 11, 2020

Variable frame rate-based data augmentation to handle speaking-style variability for automatic speaker verification

Amber Afshan, Jinxi Guo, Soo Jin Park, Vijay Ravi, Alan McCree, and, Abeer Alwan

PDF

1 Datasets

TL;DR

This paper introduces a variable frame rate data augmentation method to mitigate speaking-style variability in automatic speaker verification, significantly improving performance without requiring multi-style training data.

Contribution

The study proposes an entropy-based variable frame rate technique that normalizes speaking style differences, enhancing speaker verification accuracy across mismatched speaking styles.

Findings

01

Reduced EER in style-mismatched conditions

02

Improved robustness to speaking-style variability

03

Comparable performance to multi-style PLDA adaptation

Abstract

The effects of speaking-style variability on automatic speaker verification were investigated using the UCLA Speaker Variability database which comprises multiple speaking styles per speaker. An x-vector/PLDA (probabilistic linear discriminant analysis) system was trained with the SRE and Switchboard databases with standard augmentation techniques and evaluated with utterances from the UCLA database. The equal error rate (EER) was low when enrollment and test utterances were of the same style (e.g., 0.98% and 0.57% for read and conversational speech, respectively), but it increased substantially when styles were mismatched between enrollment and test utterances. For instance, when enrolled with conversation utterances, the EER increased to 3.03%, 2.96% and 22.12% when tested on read, narrative, and pet-directed speech, respectively. To reduce the effect of style mismatch, we propose an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sdialog/voices-ucla
dataset· 424 dl
424 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.