TL;DR
JoyVASA introduces a diffusion-based framework for audio-driven facial and animal face animation, decoupling static and dynamic features to enable longer, high-quality, multilingual videos with identity-independent motion generation.
Contribution
It proposes a novel decoupled facial representation and diffusion transformer approach that extends animation capabilities to animals and improves video length and quality.
Findings
Effective decoupling of static and dynamic facial features.
Multilingual support with diverse datasets.
Seamless animation of animal faces alongside human portraits.
Abstract
Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
