TransVIP: Speech to Speech Translation System with Voice and Isochrony   Preservation

Chenyang Le; Yao Qian; Dongmei Wang; Long Zhou; Shujie Liu; Xiaofei; Wang; Midia Yousefi; Yanmin Qian; Jinyu Li; Sheng Zhao; Michael Zeng

arXiv:2405.17809·cs.CL·November 1, 2024·1 cites

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei, Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

PDF

Open Access 1 Repo

TL;DR

TransVIP is an innovative speech-to-speech translation system that combines cascade data utilization with end-to-end inference, effectively preserving speaker voice and timing, and surpassing current models in French-English translation tasks.

Contribution

The paper introduces TransVIP, a novel framework that leverages diverse datasets in a cascade manner while enabling end-to-end inference and preserving speaker voice and isochrony.

Findings

01

Outperforms state-of-the-art speech-to-speech translation models on French-English tasks.

02

Effectively preserves speaker voice characteristics during translation.

03

Maintains source speech isochrony, suitable for video dubbing applications.

Abstract

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nethermanpro/transvip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis