Facial Landmark Predictions with Applications to Metaverse
Qiao Han, Jun Zhao, Kwok-Yan Lam

TL;DR
This paper presents a novel method to generate lip movements for metaverse characters by extending Tacotron 2 to predict lip landmarks from speech, enabling realistic lip animations learned from videos in the wild.
Contribution
It introduces a combined speech-to-lip movement model that leverages transfer learning and a new decoder predicting landmark displacements, trained efficiently on limited data.
Findings
Effective lip landmark prediction from speech with minimal training data
Transfer learning improves lip movement synthesis accuracy
Model converges in 7 hours using less than 5 minutes of video
Abstract
This research aims to make metaverse characters more realistic by adding lip animations learnt from videos in the wild. To achieve this, our approach is to extend Tacotron 2 text-to-speech synthesizer to generate lip movements together with mel spectrogram in one pass. The encoder and gate layer weights are pre-trained on LJ Speech 1.1 data set while the decoder is retrained on 93 clips of TED talk videos extracted from LRS 3 data set. Our novel decoder predicts displacement in 20 lip landmark positions across time, using labels automatically extracted by OpenFace 2.0 landmark predictor. Training converged in 7 hours using less than 5 minutes of video. We conducted ablation study for Pre/Post-Net and pre-trained encoder weights to demonstrate the effectiveness of transfer learning between audio and visual speech data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing
Methods[LivE@PeRson]How do I talk to a real person at Expedia? · Griffin-Lim Algorithm · Dilated Causal Convolution · Sigmoid Activation · Long Short-Term Memory · Dense Connections · Max Pooling · Highway Layer · Linear Layer · Bidirectional GRU
