Textless Speech-to-Speech Translation With Limited Parallel Data

Anuj Diwan; Anirudh Srinivasan; David Harwath; Eunsol Choi

arXiv:2305.15405·cs.CL·November 8, 2024·2 cites

Textless Speech-to-Speech Translation With Limited Parallel Data

Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces PFB, a novel framework for textless speech-to-speech translation that effectively utilizes limited parallel speech data, combining pretraining, finetuning, and unsupervised backtranslation to enable translation for low-resource language pairs.

Contribution

The paper presents PFB, a new approach that trains textless S2ST models with only dozens of hours of data, bridging the gap for low-resource language translation without relying on text.

Findings

01

Achieves near state-of-the-art performance with limited data

02

Effective across multiple language pairs and domains

03

Uses a combination of pretraining, finetuning, and backtranslation

Abstract

Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Textless Speech-to-Speech Translation With Limited Parallel Data· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence