AudioFace: Language-Assisted Speech-Driven Facial Animation with Multimodal Language Models
Kai Zheng, Zejian Kang, Rui Mao, Hongyuan Zou, Yuanchen Fei, Xuanyang Xu, Xiangru Huang

TL;DR
AudioFace introduces a novel language-assisted approach for speech-driven facial animation, leveraging multimodal language models and linguistic cues to improve mouth movement accuracy.
Contribution
It is the first to incorporate linguistic and phonetic information from multimodal language models into speech-driven facial animation.
Findings
AudioFace outperforms existing methods on multiple metrics.
Using linguistic cues improves mouth movement accuracy.
Multimodal priors effectively guide facial coefficient prediction.
Abstract
Speech-driven facial animation requires accurate correspondence between acoustic signals and facial motion, especially for articulation-related mouth movements. However, directly mapping speech audio to facial coefficients often overlooks the linguistic and phonetic structure underlying speech production. In this paper, we propose AudioFace, a language-assisted framework for speech-driven blendshape generation that treats mouth-related facial coefficient prediction as a structured generation problem guided by linguistic and articulatory information. Instead of relying solely on acoustic features, our method leverages the prior knowledge of multimodal large language models and introduces transcript- and phoneme-level cues to bridge speech signals with interpretable facial actions. Extensive experiments show that AudioFace achieves superior performance across multiple evaluation metrics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
