From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang; Lu Xu; Wei Lu; Shanbo Cheng

arXiv:2505.16972·cs.CL·May 23, 2025

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Tianduo Wang, Lu Xu, Wei Lu, Shanbo Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents Speech Back-Translation, a scalable method to generate large amounts of synthetic speech from text, significantly improving multilingual ASR models with minimal real data.

Contribution

It introduces a pipeline that uses limited transcribed speech to produce high-quality synthetic speech, enabling massive data augmentation for multilingual speech recognition.

Findings

01

Generated over 500,000 hours of synthetic speech in ten languages.

02

Achieved over 30% reduction in transcription errors on average.

03

Demonstrated scalability and effectiveness of synthetic data in ASR training.

Abstract

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianduowang/speech-bt
pytorchOfficial

Videos

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques