Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation
Federico Nocentini, Kwanggyoon Seo, Qingju Liu, Claudio Ferrari, Stefano Berretti, David Ferman, Hyeongwoo Kim, Pablo Garrido, and Akin Caliskan

TL;DR
Polyglot is a diffusion-based model that enables realistic, multilingual, and personalized speech-driven facial animation without needing predefined language or speaker labels.
Contribution
It introduces a unified architecture that models language and individual style jointly, improving multilingual SDFA through self-supervised learning.
Findings
Enhanced performance in monolingual and multilingual settings
Captures expressive traits like rhythm and facial movements
Produces temporally coherent and realistic animations
Abstract
Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
