Speaker disentanglement in video-to-speech conversion

Dan Oneata; Adriana Stan; Horia Cucu

arXiv:2105.09652·eess.AS·May 21, 2021·1 cites

Speaker disentanglement in video-to-speech conversion

Dan Oneata, Adriana Stan, Horia Cucu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-speaker video-to-speech model that disentangles speaker identity from speech content, enabling voice control and synthesis for unseen speakers while maintaining speech intelligibility.

Contribution

It proposes a novel architecture with adversarial training to separate speaker identity from linguistic content in video-to-speech conversion, extending capabilities to multiple and unseen speakers.

Findings

01

Visual encoder learns speaker identity from lip movements.

02

Adversarial losses improve disentanglement of speaker and content.

03

Method enables voice control and speech synthesis for unseen speakers.

Abstract

The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to i) leverage datasets with multiple speakers or few samples per speaker; and ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it to the multi-speaker scenario: we augment the network with an additional speaker-related input, through which we feed either a discrete identity or a speaker embedding. Interestingly, we observe that the visual encoder of the network is capable of learning the speaker identity from the lip region of the face alone. To better disentangle the two inputs -- linguistic content and speaker identity -- we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

danoneata/xts
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Face recognition and analysis · Speech Recognition and Synthesis