SimulTron: On-Device Simultaneous Speech to Speech Translation
Alex Agranovich, Eliya Nachmani, Oleg Rybakov, Yifan Ding, Ye Jia,, Nadav Bar, Heiga Zen, Michelle Tadmor Ramanovich

TL;DR
SimulTron is a lightweight, on-device speech-to-speech translation model that improves real-time translation accuracy and latency, demonstrating successful deployment on mobile hardware and surpassing previous models in evaluations.
Contribution
We introduce SimulTron, a novel streaming S2ST architecture optimized for mobile devices, with key modifications to existing frameworks for improved performance and real-time operation.
Findings
SimulTron outperforms Translatotron 2 in offline evaluations.
SimulTron achieves better BLEU scores and latency in real-time tests.
Successfully deployed on Pixel 7 Pro, demonstrating on-device capability.
Abstract
Simultaneous speech-to-speech translation (S2ST) holds the promise of breaking down communication barriers and enabling fluid conversations across languages. However, achieving accurate, real-time translation through mobile devices remains a major challenge. We introduce SimulTron, a novel S2ST architecture designed to tackle this task. SimulTron is a lightweight direct S2ST model that uses the strengths of the Translatotron framework while incorporating key modifications for streaming operation, and an adjustable fixed delay. Our experiments show that SimulTron surpasses Translatotron 2 in offline evaluations. Furthermore, real-time evaluations reveal that SimulTron improves upon the performance achieved by Translatotron 1. Additionally, SimulTron achieves superior BLEU scores and latency compared to previous real-time S2ST method on the MuST-C dataset. Significantly, we have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
