Direct Speech-to-Speech Neural Machine Translation: A Survey
Mahendra Gupta, Maitreyee Dutta, Chandresh Kumar Maurya

TL;DR
This survey reviews direct speech-to-speech translation models, highlighting their advantages, current limitations, and future research challenges in achieving high-quality, real-world performance without relying on intermediate text representations.
Contribution
It provides the first comprehensive overview of direct S2ST models, analyzing their performance, data issues, and potential future directions for researchers.
Findings
Direct S2ST models can translate speech without intermediate text.
Current models lag behind cascade systems in translation quality.
Challenges include data scarcity and real-world application performance.
Abstract
Speech-to-Speech Translation (S2ST) models transform speech from one language to another target language with the same linguistic information. S2ST is important for bridging the communication gap among communities and has diverse applications. In recent years, researchers have introduced direct S2ST models, which have the potential to translate speech without relying on intermediate text generation, have better decoding latency, and the ability to preserve paralinguistic and non-linguistic features. However, direct S2ST has yet to achieve quality performance for seamless communication and still lags behind the cascade models in terms of performance, especially in real-world translation. To the best of our knowledge, no comprehensive survey is available on the direct S2ST system, which beginners and advanced researchers can look upon for a quick survey. The present work provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
