Textless Speech-to-Speech Translation With Limited Parallel Data
Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

TL;DR
This paper introduces PFB, a novel framework for textless speech-to-speech translation that effectively utilizes limited parallel speech data, combining pretraining, finetuning, and unsupervised backtranslation to enable translation for low-resource language pairs.
Contribution
The paper presents PFB, a new approach that trains textless S2ST models with only dozens of hours of data, bridging the gap for low-resource language translation without relying on text.
Findings
Achieves near state-of-the-art performance with limited data
Effective across multiple language pairs and domains
Uses a combination of pretraining, finetuning, and backtranslation
Abstract
Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
