SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang; Zhaoyang Li; Duomin Wang; Jiahe Zhang; Deyu Zhou; Zixin Yin; Xili Dai; Gang Yu; Xiu Li

arXiv:2507.09862·cs.CV·July 15, 2025

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Youliang Zhang, Zhaoyang Li, Duomin Wang, Jiahe Zhang, Deyu Zhou, Zixin Yin, Xili Dai, Gang Yu, Xiu Li

PDF

Open Access 3 Reviews

TL;DR

SpeakerVid-5M is a comprehensive large-scale dataset designed to advance research in audio-visual dyadic virtual human generation, supporting diverse interaction types and high-quality data for training and benchmarking models.

Contribution

We introduce SpeakerVid-5M, the first large-scale, high-quality dataset for audio-visual dyadic virtual human generation, including a baseline model and benchmark for future research.

Findings

01

Dataset contains over 8,743 hours and 5.2 million clips.

02

Structured into interaction types and data quality levels.

03

Baseline video chat model trained on the dataset.

Abstract

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,743 hours, SpeakerVid-5M contains more than 5.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch,…

Peer Reviews

Decision·ICLR 2026 ConditionalPoster

Reviewer 01Rating 6Confidence 3

Strengths

Dyadic human videos are an important mode of human-centric video generation. Having a large dataset with paired ASR, Audio and video is highly valuable. The paper is easy to follow and the authors provide an extensive ethics statement. VidChatBench is a reasonable benchmark suite for the proposed dataset.

Weaknesses

It seems that the dataset only contains pairs of videos - initiator -> respond. However, to future-prove this, I wonder if the authors could make available extended back-and-forth sequences of their data as well, i.e. where the initiator and responder engage in a back and forth way. The authors claim that the video resolution is 1080P - however, their sample videos in the supplementary material are crops which are significantly smaller than 1080P. Could the authors clarify if they will release

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper addresses a critical and timely problem. The research community's focus is clearly shifting from passive, audio-driven "talking heads" to proactive, interactive digital humans. The authors correctly identify that the single greatest barrier to open-source academic research in this area is the lack of a large-scale, high-quality dataset specifically capturing dyadic audio-visual interactions. This contribution directly unblocks this important future direction. 2. The dataset is a sig

Weaknesses

1. Limited Data Domain and Generality: The data sources, while high-quality, are heavily skewed towards formal or semi-formal scenarios (interviews, news, seminars, debates). The dataset appears to lack more casual, "in-the-wild" interaction styles, such as personal vlogs, movie/TV drama scenes, or general user-generated content. This "domain bias" might limit the ability of models trained on this data to generalize to the full spectrum of human interaction needed for a truly "general purpose" d

Reviewer 03Rating 6Confidence 2

Strengths

1. SpeakerVid-5M is one of the largest open-source datasets in the speaker understanding domain, with diverse speakers, recording conditions, and languages, making it valuable for pretraining multimodal encoders. 2. By integrating speaker verification, lip-reading, and voice–face retrieval within one dataset, it provides a practical testbed for representation transfer and multi-task learning. 3. The use of automatic speech-face alignment and quality filtering (lip-sync thresholding, blur detec

Weaknesses

1. The core contribution is the dataset’s size, not a new method, model, or theoretical insight. It extends existing pipelines rather than rethinking them. 2. There is little quantitative characterization of dataset diversity, no breakdown by gender, age, language, recording environment, or cultural representation. 3. Large-scale scraping of online videos raises questions about consent, copyright, and potential bias amplification. The ethical section is superficial and lacks compliance discuss

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Human Motion and Animation