Speaker disentanglement in video-to-speech conversion
Dan Oneata, Adriana Stan, Horia Cucu

TL;DR
This paper introduces a multi-speaker video-to-speech model that disentangles speaker identity from speech content, enabling voice control and synthesis for unseen speakers while maintaining speech intelligibility.
Contribution
It proposes a novel architecture with adversarial training to separate speaker identity from linguistic content in video-to-speech conversion, extending capabilities to multiple and unseen speakers.
Findings
Visual encoder learns speaker identity from lip movements.
Adversarial losses improve disentanglement of speaker and content.
Method enables voice control and speech synthesis for unseen speakers.
Abstract
The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to i) leverage datasets with multiple speakers or few samples per speaker; and ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it to the multi-speaker scenario: we augment the network with an additional speaker-related input, through which we feed either a discrete identity or a speaker embedding. Interestingly, we observe that the visual encoder of the network is capable of learning the speaker identity from the lip region of the face alone. To better disentangle the two inputs -- linguistic content and speaker identity -- we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis
