TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Shunian Chen; Hejin Huang; Yexin Liu; Zihan Ye; Pengcheng Chen; Chenghao Zhu; Michael Guan; Rongsheng Wang; Junying Chen; Guanbin Li; Ser-Nam Lim; Harry Yang; Benyou Wang

arXiv:2508.13618·cs.CV·August 20, 2025

TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis

Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang

PDF

1 Datasets

TL;DR

TalkVid introduces a large, diverse, high-quality dataset for audio-driven talking head synthesis, significantly improving model generalization across ethnicity, language, and age groups, and highlighting subgroup performance disparities.

Contribution

The paper presents TalkVid, a large-scale, meticulously curated dataset that enhances diversity and quality for training talking head models, and introduces TalkVid-Bench for comprehensive evaluation.

Findings

01

Models trained on TalkVid outperform previous datasets.

02

TalkVid improves cross-dataset generalization.

03

Performance disparities across demographic subgroups are revealed.

Abstract

Audio-driven talking head synthesis has achieved remarkable photorealism, yet state-of-the-art (SOTA) models exhibit a critical failure: they lack generalization to the full spectrum of human diversity in ethnicity, language, and age groups. We argue that this generalization gap is a direct symptom of limitations in existing training data, which lack the necessary scale, quality, and diversity. To address this challenge, we introduce TalkVid, a new large-scale, high-quality, and diverse dataset containing 1244 hours of video from 7729 unique speakers. TalkVid is curated through a principled, multi-stage automated pipeline that rigorously filters for motion stability, aesthetic quality, and facial detail, and is validated against human judgments to ensure its reliability. Furthermore, we construct and release TalkVid-Bench, a stratified evaluation set of 500 clips meticulously balanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

FreedomIntelligence/TalkVid
dataset· 1.4k dl
1.4k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.