Textless Direct Speech-to-Speech Translation with Discrete Speech   Representation

Xinjian Li; Ye Jia; Chung-Cheng Chiu

arXiv:2211.00115·cs.CL·November 17, 2022·1 cites

Textless Direct Speech-to-Speech Translation with Discrete Speech Representation

Xinjian Li, Ye Jia, Chung-Cheng Chiu

PDF

Open Access

TL;DR

This paper introduces Textless Translatotron, an end-to-end speech-to-speech translation model that operates without textual supervision by predicting discrete speech representations, achieving comparable or superior performance to existing models.

Contribution

It proposes a novel textless S2ST model using discrete speech representations, enabling translation without written language dependence.

Findings

01

Achieves near-parity with Translatotron 2 in multilingual translation quality.

02

Outperforms previous textless models by +18.5 BLEU on Spanish-English corpus.

03

Demonstrates effectiveness of discrete speech representations in end-to-end translation.

Abstract

Research on speech-to-speech translation (S2ST) has progressed rapidly in recent years. Many end-to-end systems have been proposed and show advantages over conventional cascade systems, which are often composed of recognition, translation and synthesis sub-systems. However, most of the end-to-end systems still rely on intermediate textual supervision during training, which makes it infeasible to work for languages without written forms. In this work, we propose a novel model, Textless Translatotron, which is based on Translatotron 2, for training an end-to-end direct S2ST model without any textual supervision. Instead of jointly training with an auxiliary task predicting target phonemes as in Translatotron 2, the proposed model uses an auxiliary task predicting discrete speech representations which are obtained from learned or random speech quantizers. When a speech encoder pre-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling