Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

Federico Nocentini; Kwanggyoon Seo; Qingju Liu; Claudio Ferrari; Stefano Berretti; David Ferman; Hyeongwoo Kim; Pablo Garrido; and Akin Caliskan

arXiv:2604.16108·cs.CV·April 20, 2026

Polyglot: Multilingual Style Preserving Speech-Driven Facial Animation

Federico Nocentini, Kwanggyoon Seo, Qingju Liu, Claudio Ferrari, Stefano Berretti, David Ferman, Hyeongwoo Kim, Pablo Garrido, and Akin Caliskan

PDF

TL;DR

Polyglot is a diffusion-based model that enables realistic, multilingual, and personalized speech-driven facial animation without needing predefined language or speaker labels.

Contribution

It introduces a unified architecture that models language and individual style jointly, improving multilingual SDFA through self-supervised learning.

Findings

01

Enhanced performance in monolingual and multilingual settings

02

Captures expressive traits like rhythm and facial movements

03

Produces temporally coherent and realistic animations

Abstract

Speech-Driven Facial Animation (SDFA) has gained significant attention due to its applications in movies, video games, and virtual reality. However, most existing models are trained on single-language data, limiting their effectiveness in real-world multilingual scenarios. In this work, we address multilingual SDFA, which is essential for realistic generation since language influences phonetics, rhythm, intonation, and facial expressions. Speaking style is also shaped by individual differences, not only by language. Existing methods typically rely on either language-specific or speaker-specific conditioning, but not both, limiting their ability to model their interaction. We introduce Polyglot, a unified diffusion-based architecture for personalized multilingual SDFA. Our method uses transcript embeddings to encode language information and style embeddings extracted from reference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.