Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
Qingkai Fang, Shaolei Zhang, Zhengrui Ma, Min Zhang, Yang Feng

TL;DR
This paper introduces ComSpeech, a novel model that integrates pretrained S2TT and TTS models for direct speech-to-speech translation, and proposes ComSpeech-ZS for zero-shot translation without parallel speech data, achieving competitive results.
Contribution
The paper presents a new composite S2ST model and a zero-shot training method that leverages existing pretrained models and data, reducing reliance on parallel speech data.
Findings
ComSpeech outperforms previous models in translation quality and speed when parallel data is available.
ComSpeech-ZS achieves near state-of-the-art results without parallel speech data.
The methods effectively utilize pretrained models and contrastive learning for zero-shot translation.
Abstract
Recently proposed two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS) within an end-to-end model, yielding promising results. However, the training of these models still relies on parallel speech data, which is extremely challenging to collect. In contrast, S2TT and TTS have accumulated a large amount of data and pretrained models, which have not been fully utilized in the development of S2ST models. Inspired by this, in this paper, we first introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model. Furthermore, to eliminate the reliance on parallel speech data, we propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data. It aligns representations in the latent space through contrastive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems
