VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Alexandre Symeonidis-Herzig; \"Ozge Mercano\u{g}lu Sincan; and Richard Bowden

arXiv:2507.06060·cs.CV·July 22, 2025

VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis

Alexandre Symeonidis-Herzig, \"Ozge Mercano\u{g}lu Sincan, and Richard Bowden

PDF

Open Access

TL;DR

VisualSpeaker introduces a novel 3D avatar lip synthesis method that leverages photorealistic rendering and visual speech recognition supervision to enhance animation quality and accuracy.

Contribution

It presents a new perceptual lip-reading loss using differentiable rendering and pre-trained speech recognition, bridging 2D visual advances with 3D facial animation.

Findings

01

Improves Lip Vertex Error by 56.1%

02

Enhances perceptual quality of animations

03

Maintains controllability of mesh-driven animation

Abstract

Realistic, high-fidelity 3D facial animations are crucial for expressive avatar systems in human-computer interaction and accessibility. Although prior methods show promising quality, their reliance on the mesh domain limits their ability to fully leverage the rapid visual innovations seen in 2D computer vision and graphics. We propose VisualSpeaker, a novel method that bridges this gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation. Our contribution is a perceptual lip-reading loss, derived by passing photorealistic 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training. Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Facial Nerve Paralysis Treatment and Research

MethodsFocus