A Novel Speech-Driven Lip-Sync Model with CNN and LSTM
Xiaohong Li, Xiang Wang, Kai Wang, Shiguo Lian

TL;DR
This paper introduces a deep neural network combining CNN and LSTM to generate realistic, synchronized lip movements from speech, enhancing virtual character animation with robustness and naturalness.
Contribution
The novel model integrates CNN and LSTM for speech-driven lip-sync, utilizing speech recognition features and velocity loss for improved robustness and smoothness.
Findings
Generated lip movements are synchronized with speech.
Model produces smooth and natural lip animations.
Effective on Mandarin speech dataset.
Abstract
Generating synchronized and natural lip movement with speech is one of the most important tasks in creating realistic virtual characters. In this paper, we present a combined deep neural network of one-dimensional convolutions and LSTM to generate vertex displacement of a 3D template face model from variable-length speech input. The motion of the lower part of the face, which is represented by the vertex movement of 3D lip shapes, is consistent with the input speech. In order to enhance the robustness of the network to different sound signals, we adapt a trained speech recognition model to extract speech feature, and a velocity loss term is adopted to reduce the jitter of generated facial animation. We recorded a series of videos of a Chinese adult speaking Mandarin and created a new speech-animation dataset to compensate the lack of such public data. Qualitative and quantitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory
