Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer
Yongqi Wang, Jionghao Bai, Rongjie Huang, Ruiqi Li, Zhiqing Hong and, Zhou Zhao

TL;DR
This paper presents a speech-to-speech translation system that uses discrete self-supervised representations and in-context learning to perform style transfer, preserving speaker identity without needing parallel data, and achieving zero-shot cross-lingual style transfer.
Contribution
The novel approach combines discrete self-supervised speech representations with in-context learning for style transfer, eliminating the need for speaker-parallel data and enabling zero-shot cross-lingual style transfer.
Findings
Achieves high fidelity and speaker similarity in translated speech.
Enables zero-shot cross-lingual style transfer on unseen languages.
Does not rely on speaker-parallel data for style transfer.
Abstract
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy, but is unable to preserve the speaker timbre of the source speech. Meanwhile, the scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation. We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and codec units. The acoustic language model we introduce for style transfer leverages self-supervised in-context learning, acquiring style transfer ability without relying on any speaker-parallel data, thereby overcoming data scarcity. By using extensive training data, our model achieves zero-shot cross-lingual style transfer on previously unseen source languages. Experiments show that our model generates translated speeches with high fidelity and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
