VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization
Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu

TL;DR
VQTalker introduces a novel vector quantization framework for multilingual talking head generation, enabling realistic, synchronized facial animations across languages with limited data, by discretizing facial motions into shared sound units.
Contribution
It proposes a facial motion tokenizer based on GRFSQ for capturing and generalizing facial movements across languages, advancing multilingual talking face synthesis.
Findings
Achieves state-of-the-art results in multilingual scenarios
Generates high-quality videos at 512x512 resolution with low bitrate
Demonstrates effective cross-lingual facial motion transfer
Abstract
We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsFace recognition and analysis · Human Motion and Animation · Human Pose and Action Recognition
MethodsSparse Evolutionary Training
