FlexLip: A Controllable Text-to-Lip System

Dan Oneata; Beata Lorincz; Adriana Stan; Horia Cucu

arXiv:2206.03206·eess.AS·June 8, 2022

FlexLip: A Controllable Text-to-Lip System

Dan Oneata, Beata Lorincz, Adriana Stan, Horia Cucu

PDF

TL;DR

FlexLip is a modular, controllable text-to-lip system that efficiently generates lip landmarks from text with minimal data, enabling easy adaptation to new speakers and detailed evaluation of system components.

Contribution

The paper introduces a modular architecture for text-to-lip conversion, allowing component replacement, speaker adaptation, and comprehensive evaluation methods.

Findings

01

High-quality lip landmarks achieved with minimal training data

02

Zero-shot lip adaptation to unseen identities demonstrated

03

Objective measures show competitive performance with limited data

Abstract

The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.