Speech to Speech Translation with Translatotron: A State of the Art Review
Jules R. Kala, Emmanuel Adetiba, Abdultaofeek Abayom, Oluwatobi E., Dare, Ayodele H. Ifijeh

TL;DR
This paper reviews the evolution of Translatotron, a direct speech-to-speech translation model, highlighting its improvements over cascade models and its potential to bridge language gaps, especially for African languages.
Contribution
It provides a comprehensive review of all Translatotron versions and evaluates their effectiveness compared to traditional cascade models.
Findings
Translatotron 3 outperforms cascade models in some aspects.
Translatotron models reduce compound errors in speech translation.
Translatotron is effective for bridging African languages with others.
Abstract
A cascade-based speech-to-speech translation has been considered a benchmark for a very long time, but it is plagued by many issues, like the time taken to translate a speech from one language to another and compound errors. These issues are because a cascade-based method uses a combination of methods such as speech recognition, speech-to-text translation, and finally, text-to-speech translation. Translatotron, a sequence-to-sequence direct speech-to-speech translation model was designed by Google to address the issues of compound errors associated with cascade model. Today there are 3 versions of the Translatotron model: Translatotron 1, Translatotron 2, and Translatotron3. The first version was designed as a proof of concept to show that a direct speech-to-speech translation was possible, it was found to be less effective than the cascade model but was producing promising results.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsFocus
