PolyVoice: Language Models for Speech to Speech Translation
Qianqian Dong, Zhiying Huang, Qiao Tian, Chen Xu, Tom Ko, Yunlong, Zhao, Siyuan Feng, Tang Li, Kexin Wang, Xuxin Cheng, Fengpeng Yue, Ye Bai, Xi, Chen, Lu Lu, Zejun Ma, Yuping Wang, Mingxuan Wang, Yuxuan Wang

TL;DR
PolyVoice introduces a novel speech-to-speech translation framework using language models and discretized speech units, enabling translation for unwritten languages while preserving voice characteristics and style.
Contribution
It combines translation and speech synthesis models with unsupervised speech units, allowing effective translation and voice preservation for unwritten languages.
Findings
High translation quality demonstrated on Chinese-English and English-Spanish pairs.
Effective preservation of voice characteristics and speaking style.
Framework capable of handling unwritten languages.
Abstract
We propose PolyVoice, a language model-based framework for speech-to-speech translation (S2ST) system. Our framework consists of two language models: a translation language model and a speech synthesis language model. We use discretized speech units, which are generated in a fully unsupervised way, and thus our framework can be used for unwritten languages. For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model. This grants our framework the ability to preserve the voice characteristics and the speaking style of the original speech. We examine our system on Chinese English and English Spanish pairs. Experimental results show that our system can generate speech with high translation quality and audio quality. Speech samples are available at https://speechtranslation.github.io/polyvoice.
Peer Reviews
Decision·ICLR 2024 poster
- The proposed framework is the novel in its approach towards speech-to-speech translation where it uses decoder-only models. - Decoder only framework simplifies the model architecture and hence makes the implementation of the translation system straightforward. - The proposed method is based on unsupervised semantic and acoustic units making it possible to build systems of unwritten languages. - Performance on the datasets shown is quite competitive and the ablation studies further highlight th
- The duration and speech synthesis models depend on the translation model. Hence the training of two models depend on one upstream model which can make experimentation slow. At least, the duration model can be attempted to be folded in the translation model as shown in the paper Text-Free Prosody-Aware Generative Spoken Language Modeling (Kharitonov et. al.). - Since the authors use CVSS it would be desirable to show the performance on other language pairs from the dataset to make the evaluatio
(1) The use of decoder-only LMs via different prompting strategies and discrete semantic and acoustic units for all components (translation LM, duration LM, and speech synthesis LM) could benefit S2ST from competitive pre-trained text decoder-only LLMs. (2) Empirical evaluations show that PolyVoice is comparable to VALL-E X, very slightly better on ASV, worse on ASR-BLEU, and better on naturalness. Ablation studies show the contribution of the designed duration LM which uses a LM to predict du
(1) The innovations of this work need to be more clearly explained. This work bears strong similarity to VALL-E X. It is important to clarify the difference between the proposed approach and VALL-E X, but the paper did not clearly point out the difference between PolyVoice and VALL-E X to highlight the innovations of the proposed PolyVoice. Both works concatenate source and target semantic units and the source acoustic units to create the prompt for the LM. For PolyVoice, this prompt is created
The system design (using three decoder-only models) seems sound and worth investigating, although I'm not 100% on board with motivating it with the raise of GPT - there is more to the success of LLMs than being decoder-only models. Anyways, the main results (Table 2) look solid. The system description seems clear superficially, but there are some core open questions (see weaknesses).
I couldn't get a good sense of the training data - specifically how the different prompts from Table 1 are used to synthesize data, and the size of the synthesized dataset: is it the 44M sentences from Table 7 in the appendix, or more because multiple prompts are used? How does the training data compare to the baselines? My main concern would be that the ablation studies are not effective for disentangling the many design choices and the many moving parts of the whole architecture. The encoder-
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
