Speech to Speech Translation with Translatotron: A State of the Art   Review

Jules R. Kala; Emmanuel Adetiba; Abdultaofeek Abayom; Oluwatobi E.; Dare; Ayodele H. Ifijeh

arXiv:2502.05980·cs.CL·February 21, 2025

Speech to Speech Translation with Translatotron: A State of the Art Review

Jules R. Kala, Emmanuel Adetiba, Abdultaofeek Abayom, Oluwatobi E., Dare, Ayodele H. Ifijeh

PDF

Open Access

TL;DR

This paper reviews the evolution of Translatotron, a direct speech-to-speech translation model, highlighting its improvements over cascade models and its potential to bridge language gaps, especially for African languages.

Contribution

It provides a comprehensive review of all Translatotron versions and evaluates their effectiveness compared to traditional cascade models.

Findings

01

Translatotron 3 outperforms cascade models in some aspects.

02

Translatotron models reduce compound errors in speech translation.

03

Translatotron is effective for bridging African languages with others.

Abstract

A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsFocus