JoyHallo: Digital human model for Mandarin
Sheng Shi, Xuyang Cao, Jun Zhao, Guoxin Wang

TL;DR
JoyHallo is a novel digital human model capable of generating Mandarin videos with improved efficiency and cross-language capabilities, developed using a new Mandarin speech dataset and a semi-decoupled feature integration approach.
Contribution
The paper introduces the jdh-Hallo dataset for Mandarin, adapts JoyHallo with Chinese wav2vec2, and proposes a semi-decoupled structure for better feature integration and faster inference.
Findings
Achieved 14.3% faster inference speed.
Successfully generated Mandarin videos with diverse speech styles.
Maintained strong English video generation capabilities.
Abstract
In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Context-Aware Activity Recognition Systems · Social Robot Interaction and HRI
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
