Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation
Zhen Han, Mattias Teye, Derek Yadgaroff, Judith B\"utepage

TL;DR
This paper presents a method for creating small, high-quality, real-time facial animation models suitable for on-device use in games, using hybrid knowledge distillation to overcome dataset limitations.
Contribution
The authors introduce a hybrid knowledge distillation approach with pseudo-labeling to develop tiny, efficient facial animation models that maintain quality while enabling real-time on-device inference.
Findings
Memory footprint reduced to 3.4 MB
Achieved up to 81 ms audio context requirement
Maintained high-quality facial animations
Abstract
The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
