JoyStreamer: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

Ruikui Wang; Jinheng Feng; Lang Tian; Huaishao Luo; Chaochao Li; Liangbo Zhou; Huan Zhang; Youzheng Wu; and Xiaodong He

arXiv:2602.00702·cs.CV·April 1, 2026

JoyStreamer: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning

Ruikui Wang, Jinheng Feng, Lang Tian, Huaishao Luo, Chaochao Li, Liangbo Zhou, Huan Zhang, Youzheng Wu, and Xiaodong He

PDF

1 Repo

TL;DR

JoyStreamer is a novel framework that significantly enhances avatar video generation by improving text and audio alignment, enabling complex, natural, and coherent full-body avatar motions with dynamic camera control.

Contribution

The paper introduces a twin-teacher training algorithm and dynamic multi-modal conditioning modulation to improve avatar generation capabilities.

Findings

01

Outperforms state-of-the-art models like Omnihuman-1.5 and KlingAvatar 2.0.

02

Enables complex multi-person dialogues and non-human role-playing.

03

Produces natural, temporally coherent full-body motions with dynamic camera movements.

Abstract

Existing video avatar models have demonstrated impressive capabilities in scenarios such as talking, public speaking, and singing. However, the majority of these methods exhibit limited alignment with respect to text instructions, particularly when the prompts involve complex elements including large full-body movement, dynamic camera trajectory, background transitions, or human-object interactions. To break out this limitation, we present JoyAvatar, a framework capable of generating long duration avatar videos, featuring two key technical innovations. Firstly, we introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability from the foundation model while simultaneously learning audio-visual synchronization. Secondly, during training, we dynamically modulate the strength of multi-modal conditions (e.g., audio and text) based on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://joystreamer.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.