Textless Unit-to-Unit training for Many-to-Many Multilingual   Speech-to-Speech Translation

Minsu Kim; Jeongsoo Choi; Dahun Kim; Yong Man Ro

arXiv:2308.01831·cs.CL·August 20, 2024·5 cites

Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

PDF

Open Access 1 Repo

TL;DR

This paper introduces a textless unit-to-unit training approach for multilingual speech-to-speech translation that leverages speech units as pseudo-text, enabling effective translation, TTS, and T2ST with minimal fine-tuning.

Contribution

It presents a novel speech unit-based training method for multilingual translation that bridges speech and text modalities without relying on textual data during training.

Findings

01

Effective multilingual speech-to-speech translation demonstrated.

02

Model can be adapted for TTS and T2ST with minimal fine-tuning.

03

Validated across diverse languages and tasks.

Abstract

This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

choijeongsoo/utut
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsFocus