Textless Speech-to-Speech Translation on Real Data

Ann Lee; Hongyu Gong; Paul-Ambroise Duquenne; Holger Schwenk; Peng-Jen; Chen; Changhan Wang; Sravya Popuri; Yossi Adi; Juan Pino; Jiatao Gu; Wei-Ning; Hsu

arXiv:2112.08352·cs.CL·May 6, 2022·1 cites

Textless Speech-to-Speech Translation on Real Data

Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen, Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning, Hsu

PDF

Open Access 1 Models

TL;DR

This paper introduces a novel textless speech-to-speech translation system capable of translating between languages using real-world data without text, employing self-supervised speech normalization to improve multi-speaker target speech quality.

Contribution

It presents the first textless S2ST system trained on real data that models multi-speaker target speech using a self-supervised normalization technique with minimal data.

Findings

01

Achieved 3.2 BLEU improvement with 10 minutes of normalization data

02

Added 2.0 BLEU gain by incorporating mined S2ST data

03

First to demonstrate real-data, multi-language textless S2ST system

Abstract

We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
speechbrain/s2st-transformer-fr-en-hubert-l6-k100-cvss
model· 8 dl· ♡ 4
8 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling