FlexLip: A Controllable Text-to-Lip System
Dan Oneata, Beata Lorincz, Adriana Stan, Horia Cucu

TL;DR
FlexLip is a modular, controllable text-to-lip system that efficiently generates lip landmarks from text with minimal data, enabling easy adaptation to new speakers and detailed evaluation of system components.
Contribution
The paper introduces a modular architecture for text-to-lip conversion, allowing component replacement, speaker adaptation, and comprehensive evaluation methods.
Findings
High-quality lip landmarks achieved with minimal training data
Zero-shot lip adaptation to unseen identities demonstrated
Objective measures show competitive performance with limited data
Abstract
The task of converting text input into video content is becoming an important topic for synthetic media generation. Several methods have been proposed with some of them reaching close-to-natural performances in constrained tasks. In this paper, we tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks. However, we do this using a modular, controllable system architecture and evaluate each of its individual components. Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip, both having underlying controllable deep neural network architectures. This modularity enables the easy replacement of each of its components, while also ensuring the fast adaptation to new speaker identities by disentangling or projecting the input features. We show that by using as little as 20 min of data for the audio…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
