LinguaLinker: Audio-Driven Portraits Animation with Implicit Facial Control Enhancement
Rui Zhang, Yixiao Fang, Zhengnan Lu, Pei Cheng, Zebiao Huang, Bin Fu

TL;DR
LinguaLinker is a diffusion-based framework that synchronizes facial animations with multilingual audio inputs, improving lip-sync accuracy and portrait fidelity for diverse languages.
Contribution
It introduces a holistic diffusion-based approach that implicitly controls facial movements from audio features, surpassing traditional parametric models in animation quality.
Findings
Enhanced lip-sync accuracy and portrait fidelity.
Versatile application across different languages.
Improved control over facial motion nuances.
Abstract
This study delves into the intricacies of synchronizing facial dynamics with multilingual audio inputs, focusing on the creation of visually compelling, time-synchronized animations through diffusion-based techniques. Diverging from traditional parametric models for facial animation, our approach, termed LinguaLinker, adopts a holistic diffusion-based framework that integrates audio-driven visual synthesis to enhance the synergy between auditory stimuli and visual responses. We process audio features separately and derive the corresponding control gates, which implicitly govern the movements in the mouth, eyes, and head, irrespective of the portrait's origin. The advanced audio-driven visual synthesis mechanism provides nuanced control but keeps the compatibility of output video and input audio, allowing for a more tailored and effective portrayal of distinct personas across different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Subtitles and Audiovisual Media · Face recognition and analysis
