TL;DR
JoyStreamer is a novel framework that significantly enhances avatar video generation by improving text and audio alignment, enabling complex, natural, and coherent full-body avatar motions with dynamic camera control.
Contribution
The paper introduces a twin-teacher training algorithm and dynamic multi-modal conditioning modulation to improve avatar generation capabilities.
Findings
Outperforms state-of-the-art models like Omnihuman-1.5 and KlingAvatar 2.0.
Enables complex multi-person dialogues and non-human role-playing.
Produces natural, temporally coherent full-body motions with dynamic camera movements.
Abstract
Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
