Deepfake audio as a data augmentation technique for training automatic speech to text transcription models
Alexandre R. Ferreira, Cl\'audio E. C. Campelo

TL;DR
This paper proposes using deepfake audio as a data augmentation method to improve speech-to-text models, especially for less-resourced languages, validated through experiments with Indian English datasets.
Contribution
Introduces a novel deepfake audio-based data augmentation framework for training robust speech-to-text models in low-resource language scenarios.
Findings
Augmented data improved speech-to-text model performance.
Framework effective with Indian English dataset.
Validated with existing deepfake and transcription models.
Abstract
To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
