LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker   Recognition to Overcome Data Scarcity

Jordan J. Bird; Diego R. Faria; Anik\'o Ek\'art; Cristiano Premebida,; Pedro P. S. Ayrosa

arXiv:2007.00659·eess.AS·July 6, 2020·1 cites

LSTM and GPT-2 Synthetic Speech Transfer Learning for Speaker Recognition to Overcome Data Scarcity

Jordan J. Bird, Diego R. Faria, Anik\'o Ek\'art, Cristiano Premebida,, Pedro P. S. Ayrosa

PDF

Open Access

TL;DR

This paper explores using LSTM and GPT-2 models to generate synthetic speech features, enhancing speaker recognition performance in data-scarce scenarios through transfer learning.

Contribution

It introduces a novel approach of using synthetic MFCCs generated by LSTM and GPT-2 for transfer learning in speaker recognition tasks.

Findings

01

Synthetic data improved speaker classification accuracy.

02

LSTM-generated data outperformed GPT-2 in some cases.

03

Pre-training with synthetic data often led to near-maximal scores.

Abstract

In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Cosine Annealing · Discriminative Fine-Tuning · Dropout · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Attention Is All You Need · Linear Warmup With Cosine Annealing · Attention Dropout