Audio- and Gaze-driven Facial Animation of Codec Avatars
Alexander Richard, Colin Lea, Shugao Ma, Juergen Gall, Fernando de la, Torre, Yaser Sheikh

TL;DR
This paper presents a real-time method for animating photorealistic Codec Avatars using audio and eye-tracking data, enabling expressive social interactions in virtual reality.
Contribution
It introduces the first real-time animation approach for Codec Avatars utilizing multimodal sensor fusion of audio and gaze signals.
Findings
Achieved expressive facial animations beyond neutral lip movements.
Collected over 5 hours of high-quality 3D face scan data.
Demonstrated real-time deployment on commodity VR hardware.
Abstract
Codec Avatars are a recent class of learned, photorealistic face models that accurately represent the geometry and texture of a person in 3D (i.e., for virtual reality), and are almost indistinguishable from video. In this paper we describe the first approach to animate these parametric models in real-time which could be deployed on commodity virtual reality hardware using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals such as laughter and excitement solely from latent cues in our lossy input signals. To this end we collected over 5 hours of high frame rate 3D face scans across three participants including traditional neutral speech as well as expressive and conversational speech. We investigate a multimodal fusion approach that dynamically identifies which sensor encoding should animate which parts of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
