From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition
Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng

TL;DR
This paper presents Speech Back-Translation, a scalable method to generate large amounts of synthetic speech from text, significantly improving multilingual ASR models with minimal real data.
Contribution
It introduces a pipeline that uses limited transcribed speech to produce high-quality synthetic speech, enabling massive data augmentation for multilingual speech recognition.
Findings
Generated over 500,000 hours of synthetic speech in ten languages.
Achieved over 30% reduction in transcription errors on average.
Demonstrated scalability and effectiveness of synthetic data in ASR training.
Abstract
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
