Custom Data Augmentation for low resource ASR using Bark and   Retrieval-Based Voice Conversion

Anand Kamble; Aniket Tathe; Suyash Kumbharkar; Atharva Bhandare,; Anirban C. Mitra

arXiv:2311.14836·cs.SD·January 11, 2024·2 cites

Custom Data Augmentation for low resource ASR using Bark and Retrieval-Based Voice Conversion

Anand Kamble, Aniket Tathe, Suyash Kumbharkar, Atharva Bhandare,, Anirban C. Mitra

PDF

Open Access 1 Datasets

TL;DR

This paper introduces two novel methods using Bark and Retrieval-Based Voice Conversion to create customized datasets for low-resource ASR, improving data quality and enabling personalized voice synthesis.

Contribution

It presents innovative methodologies leveraging Bark and RVC for constructing tailored datasets for low-resource languages like Hindi.

Findings

01

Enhanced dataset quality for low-resource ASR

02

Improved performance of ASR models with custom datasets

03

Potential for high-quality personalized voice generation

Abstract

This paper proposes two innovative methodologies to construct customized Common Voice datasets for low-resource languages like Hindi. The first methodology leverages Bark, a transformer-based text-to-audio model developed by Suno, and incorporates Meta's enCodec and a pre-trained HuBert model to enhance Bark's performance. The second methodology employs Retrieval-Based Voice Conversion (RVC) and uses the Ozen toolkit for data preparation. Both methodologies contribute to the advancement of ASR technology and offer valuable insights into addressing the challenges of constructing customized Common Voice datasets for under-resourced languages. Furthermore, they provide a pathway to achieving high-quality, personalized voice generation for a range of applications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Aniket-Tathe-08/Custom_common_voice_dataset_using_RVC
dataset· 131 dl
131 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing