Generating Data with Text-to-Speech and Large-Language Models for   Conversational Speech Recognition

Samuele Cornell; Jordan Darefsky; Zhiyao Duan; Shinji; Watanabe

arXiv:2408.09215·eess.AS·August 20, 2024

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition

Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji, Watanabe

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel pipeline combining large language models and multi-speaker TTS to generate synthetic conversational speech data, improving multi-speaker ASR performance without extensive manual data collection.

Contribution

It presents a new synthetic data generation method for multi-speaker conversational ASR using LLMs and TTS, reducing manual effort and domain mismatch issues.

Findings

01

Synthetic data improves ASR accuracy over classical methods.

02

Fine-tuning Whisper with generated data outperforms using external datasets.

03

Method is effective for telephone and distant conversational speech.

Abstract

Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

popcornell/ASRLightningFT
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis