VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge   transfer from voice conversion

Disong Wang; Shan Yang; Dan Su; Xunying Liu; Dong Yu; Helen Meng

arXiv:2202.09081·eess.AS·February 21, 2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a multi-speaker Video-to-Speech system that leverages cross-modal knowledge transfer from voice conversion, enabling accurate speech synthesis from silent videos with controllable speaker identity.

Contribution

It proposes a novel framework combining vector quantization, contrastive predictive coding, and cross-modal transfer for multi-speaker VTS, achieving state-of-the-art results.

Findings

01

High-quality speech synthesis with naturalness and intelligibility

02

Effective speaker control in multi-speaker scenarios

03

State-of-the-art performance in both constrained and open vocabulary settings

Abstract

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsInfoNCE · Contrastive Predictive Coding