JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Xinhan Di; Kristin Qi; Pengqian Yu

arXiv:2507.20987·cs.CV·July 30, 2025

JWB-DH-V1: Benchmark for Joint Whole-Body Talking Avatar and Speech Generation Version 1

Xinhan Di, Kristin Qi, Pengqian Yu

PDF

Open Access

TL;DR

This paper introduces JWB-DH-V1, a comprehensive benchmark dataset and evaluation protocol for joint whole-body talking avatar and speech generation, addressing current challenges in multi-modal consistency and region-specific performance assessment.

Contribution

It provides a large-scale multi-modal dataset and evaluation framework specifically designed for assessing joint audio-visual generation of whole-body avatars.

Findings

01

Performance disparities between face/hand-centric and whole-body generation.

02

Current models struggle with multi-modal consistency in whole-body synthesis.

03

Benchmark reveals key areas for future research in avatar generation.

Abstract

Recent advances in diffusion-based video generation have enabled photo-realistic short clips, but current methods still struggle to achieve multi-modal consistency when jointly generating whole-body motion and natural speech. Current approaches lack comprehensive evaluation frameworks that assess both visual and audio quality, and there are insufficient benchmarks for region-specific performance analysis. To address these gaps, we introduce the Joint Whole-Body Talking Avatar and Speech Generation Version I(JWB-DH-V1), comprising a large-scale multi-modal dataset with 10,000 unique identities across 2 million video samples, and an evaluation protocol for assessing joint audio-video generation of whole-body animatable avatars. Our evaluation of SOTA models reveals consistent performance disparities between face/hand-centric and whole-body performance, which incidates essential areas for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI