JoyHallo: Digital human model for Mandarin

Sheng Shi; Xuyang Cao; Jun Zhao; Guoxin Wang

arXiv:2409.13268·cs.CV·September 23, 2024

JoyHallo: Digital human model for Mandarin

Sheng Shi, Xuyang Cao, Jun Zhao, Guoxin Wang

PDF

Open Access 1 Models

TL;DR

JoyHallo is a novel digital human model capable of generating Mandarin videos with improved efficiency and cross-language capabilities, developed using a new Mandarin speech dataset and a semi-decoupled feature integration approach.

Contribution

The paper introduces the jdh-Hallo dataset for Mandarin, adapts JoyHallo with Chinese wav2vec2, and proposes a semi-decoupled structure for better feature integration and faster inference.

Findings

01

Achieved 14.3% faster inference speed.

02

Successfully generated Mandarin videos with diverse speech styles.

03

Maintained strong English video generation capabilities.

Abstract

In audio-driven video generation, creating Mandarin videos presents significant challenges. Collecting comprehensive Mandarin datasets is difficult, and the complex lip movements in Mandarin further complicate model training compared to English. In this study, we collected 29 hours of Mandarin speech video from JD Health International Inc. employees, resulting in the jdh-Hallo dataset. This dataset includes a diverse range of ages and speaking styles, encompassing both conversational and specialized medical topics. To adapt the JoyHallo model for Mandarin, we employed the Chinese wav2vec2 model for audio feature embedding. A semi-decoupled structure is proposed to capture inter-feature relationships among lip, expression, and pose features. This integration not only improves information utilization efficiency but also accelerates inference speed by 14.3%. Notably, JoyHallo maintains its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
jdh-algo/JoyHallo-v1
model· 19 dl· ♡ 11
19 dl♡ 11

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Context-Aware Activity Recognition Systems · Social Robot Interaction and HRI

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings