SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Amirbek Djanibekov; Luisa Bentivogli; Matteo Negri; Sara Papi

arXiv:2603.16924·eess.AS·March 19, 2026

SimulU: Training-free Policy for Long-form Simultaneous Speech-to-Speech Translation

Amirbek Djanibekov, Luisa Bentivogli, Matteo Negri, Sara Papi

PDF

Open Access

TL;DR

SimulU introduces a training-free policy for long-form simultaneous speech-to-speech translation, leveraging pre-trained models to improve real-time multilingual communication without extensive training.

Contribution

It is the first approach to enable training-free, long-form SimulS2S using history management and cross-attention strategies in pre-trained models.

Findings

01

Achieves comparable quality-latency trade-offs to trained models.

02

Operates effectively on continuous, long-form speech.

03

Eliminates need for resource-intensive training procedures.

Abstract

Simultaneous speech-to-speech translation (SimulS2S) is essential for real-time multilingual communication, with increasing integration into meeting and streaming platforms. Despite this, SimulS2S remains underexplored in research, where current solutions often rely on resource-intensive training procedures and operate on short-form, pre-segmented utterances, failing to generalize to continuous speech. To bridge this gap, we propose SimulU, the first training-free policy for long-form SimulS2S. SimulU adopts history management and speech output selection strategies that exploit cross-attention in pre-trained end-to-end models to regulate both input history and output generation. Evaluations on MuST-C across 8 languages show that SimulU achieves a better or comparable quality-latency trade-off against strong cascaded models. By eliminating the need for ad-hoc training, SimulU offers a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems