Enhancing Synthetic Training Data for Speech Commands: From ASR-Based   Filtering to Domain Adaptation in SSL Latent Space

Sebasti\~ao Quintas; Isabelle Ferran\'e; Thomas Pellegrini

arXiv:2409.12745·cs.SD·September 20, 2024

Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space

Sebasti\~ao Quintas, Isabelle Ferran\'e, Thomas Pellegrini

PDF

Open Access

TL;DR

This paper investigates improving synthetic speech data for speech command classification by filtering with ASR and domain adaptation in SSL latent space, demonstrating enhanced data quality and model performance.

Contribution

It introduces a simple ASR-based filtering method and explores domain adaptation using CycleGAN to improve synthetic speech data for classification tasks.

Findings

01

ASR-based filtering improves synthetic data quality and classification performance.

02

Self-supervised features reveal distinguishability between synthetic and real speech.

03

CycleGAN can bridge the gap between synthetic and real speech in SSL space.

Abstract

The use of synthetic speech as data augmentation is gaining increasing popularity in fields such as automatic speech recognition and speech classification tasks. Despite novel text-to-speech systems with voice cloning capabilities, that allow the usage of a larger amount of voices based on short audio segments, it is known that these systems tend to hallucinate and oftentimes produce bad data that will most likely have a negative impact on the downstream task. In the present work, we conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data, translating to a better performance. Furthermore, despite the good quality of the generated speech data, we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Sparse Evolutionary Training · Batch Normalization · Residual Connection · Tanh Activation · PatchGAN · Residual Block · Cycle Consistency Loss · GAN Least Squares Loss · Instance Normalization