RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

Zhisheng Zheng; Xiaohang Sun; Tuan Dinh; Abhishek Yanamandra; Abhinav Jain; Zhu Liu; Sunil Hadap; Vimal Bhat; Manoj Aggarwal; Gerard Medioni; David Harwath

arXiv:2511.20974·eess.AS·February 17, 2026

RosettaSpeech: Zero-Shot Speech-to-Speech Translation without Parallel Speech

Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath

PDF

Open Access

TL;DR

RosettaSpeech is a zero-shot speech-to-speech translation framework that leverages monolingual data and machine translation supervision to achieve state-of-the-art results without requiring parallel speech data.

Contribution

It introduces a novel training approach using text as a semantic bridge, enabling end-to-end speech translation without parallel speech corpora.

Findings

01

Achieves state-of-the-art zero-shot translation performance on CVSS-C benchmark.

02

Effectively preserves source speaker's voice without paired speech data.

03

Demonstrates scalability in many-to-one translation scenarios.

Abstract

End-to-end speech-to-speech translation (S2ST) systems typically struggle with a critical data bottleneck: the scarcity of parallel speech-to-speech corpora. To overcome this, we introduce RosettaSpeech, a novel zero-shot framework trained exclusively on monolingual speech-text data augmented by machine translation supervision. Unlike prior works that rely on complex cascaded pseudo-labeling, our approach strategically utilizes text as a semantic bridge during training to synthesize translation targets, thereby eliminating the need for parallel speech pairs while maintaining a direct, end-to-end inference pipeline. Empirical evaluations on the CVSS-C benchmark demonstrate that RosettaSpeech achieves state-of-the-art zero-shot performance, surpassing leading baselines by significant margins - achieving ASR-BLEU scores of 25.17 for German-to-English (+27% relative gain) and 29.86 for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling