Textless Speech-to-Speech Translation on Real Data
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne, Holger Schwenk, Peng-Jen, Chen, Changhan Wang, Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Wei-Ning, Hsu

TL;DR
This paper introduces a novel textless speech-to-speech translation system capable of translating between languages using real-world data without text, employing self-supervised speech normalization to improve multi-speaker target speech quality.
Contribution
It presents the first textless S2ST system trained on real data that models multi-speaker target speech using a self-supervised normalization technique with minimal data.
Findings
Achieved 3.2 BLEU improvement with 10 minutes of normalization data
Added 2.0 BLEU gain by incorporating mined S2ST data
First to demonstrate real-data, multi-language textless S2ST system
Abstract
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data. The key to our approach is a self-supervised unit-based speech normalization technique, which finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents, while preserving the lexical content. With only 10 minutes of paired data for speech normalization, we obtain on average 3.2 BLEU gain when training the S2ST model on the VoxPopuli S2ST dataset, compared to a baseline trained on un-normalized speech target. We also incorporate automatically mined S2ST data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
