SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation
Amirbek Djanibekov, Luisa Bentivogli, Matteo Negri, Sara Papi

TL;DR
SimulU introduces a training-free policy for long-form simultaneous speech-to-speech translation, leveraging pre-trained models to improve real-time multilingual communication without extensive training.
Contribution
It is the first approach to enable training-free, long-form SimulS2S using history management and cross-attention strategies in pre-trained models.
Findings
Achieves comparable quality-latency trade-offs to trained models.
Operates effectively on continuous, long-form speech.
Eliminates need for resource-intensive training procedures.
Abstract
Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems
