VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis
Alexandre Symeonidis-Herzig, \"Ozge Mercano\u{g}lu Sincan, and Richard Bowden

TL;DR
VisualSpeaker introduces a novel 3D avatar lip synthesis method that leverages photorealistic rendering and visual speech recognition supervision to enhance animation quality and accuracy.
Contribution
It presents a new perceptual lip-reading loss using differentiable rendering and pre-trained speech recognition, bridging 2D visual advances with 3D facial animation.
Findings
Improves Lip Vertex Error by 56.1%
Enhances perceptual quality of animations
Maintains controllability of mesh-driven animation
Abstract
Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Facial Nerve Paralysis Treatment and Research
MethodsFocus
