Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
Samuele Cornell, Jordan Darefsky, Zhiyao Duan, Shinji, Watanabe

TL;DR
This paper introduces a novel pipeline combining large language models and multi-speaker TTS to generate synthetic conversational speech data, improving multi-speaker ASR performance without extensive manual data collection.
Contribution
It presents a new synthetic data generation method for multi-speaker conversational ASR using LLMs and TTS, reducing manual effort and domain mismatch issues.
Findings
Synthetic data improves ASR accuracy over classical methods.
Fine-tuning Whisper with generated data outperforms using external datasets.
Method is effective for telephone and distant conversational speech.
Abstract
Currently, a common approach in many speech processing tasks is to leverage large scale pre-trained models by fine-tuning them on in-domain data for a particular application. Yet obtaining even a small amount of such data can be problematic, especially for sensitive domains and conversational speech scenarios, due to both privacy issues and annotation costs. To address this, synthetic data generation using single speaker datasets has been employed. Yet, for multi-speaker cases, such an approach often requires extensive manual effort and is prone to domain mismatches. In this work, we propose a synthetic data generation pipeline for multi-speaker conversational ASR, leveraging a large language model (LLM) for content creation and a conversational multi-speaker text-to-speech (TTS) model for speech synthesis. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
