VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion
Disong Wang, Shan Yang, Dan Su, Xunying Liu, Dong Yu, Helen Meng

TL;DR
This paper introduces a multi-speaker Video-to-Speech system that leverages cross-modal knowledge transfer from voice conversion, enabling accurate speech synthesis from silent videos with controllable speaker identity.
Contribution
It proposes a novel framework combining vector quantization, contrastive predictive coding, and cross-modal transfer for multi-speaker VTS, achieving state-of-the-art results.
Findings
High-quality speech synthesis with naturalness and intelligibility
Effective speaker control in multi-speaker scenarios
State-of-the-art performance in both constrained and open vocabulary settings
Abstract
Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsInfoNCE · Contrastive Predictive Coding
