TL;DR
This paper introduces MIT, a large-scale multi-human talking video dataset with annotations, and proposes CovOG, a baseline model for generating realistic multi-person talking videos, addressing a gap in existing single-person focused research.
Contribution
The paper presents a new multi-human talking video dataset and a baseline model that handles multiple speakers and interactions, advancing research in multi-person conversational video generation.
Findings
MIT dataset contains 12 hours of high-resolution multi-speaker videos.
CovOG demonstrates the feasibility of multi-human talking video synthesis.
The dataset and model establish a benchmark for future multi-human interaction studies.
Abstract
Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to…
Peer Reviews
Decision·Submitted to ICLR 2026
The community really need a high-quality large-scale Multi-human Interactive Talking dataset. Where gestures, expressions and other dynamic behaviors of the human subject involve in the talking MUST BE TEMPORALLY AND PHYSICALLY CONSISTENT. However, the results shown in the supplementary video lacks all these attributes,
The video result shown in the supplementary results are not temporally and physically consistent. For examples - the number of fingers and their shapes changing across frames. Flat palm frequently morphing into fist suddenly - Object in the hand are changing across frames, - Facial expressions are not consistent across frames
The problem of multi-human interaction is highly relevant and challenging. Providing high-quality datasets for this task is crucial. To proposed baseline is simple yet effective.
The authors note that “How to effectively evaluate lip synchronization in such interactive contexts remains an open problem” (L353-354) and instead opt for a user study. While a user study is useful, I would argue for a dataset an accompanying is crucial. My concern here is that the dataset might be released prematurely without a good benchmark that measures how well the audio and video align, i.e. how well the speaker (lip movement, correct person) and the audio are in agreement. I wonder if s
1.This work presents the first dataset specifically designed for multi-human talking video generation, addressing the scarcity of multi-human interactive data. 2. It develops an automatic data collection pipeline and construct the first dataset for multi-human talking video generation, featuring annotations of pose and speech interaction. 3. A baseline model is proposed for this task, which supports a flexible number of human speakers and captures the dynamics of speech interactions. We further
1. Although this paper is the first to propose a dataset for multi-human talking video generation, several recent works have also addressed this task, including: [1] HunyuanVideo-Avatar: High-Fidelity Audio-Driven Human Animation for Multiple Characters [2] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation [3] Bind-Your-Avatar: Multi-Talking-Character Video Generation with Dynamic 3D-mask-based Embedding Router 2. Current video generation models typically require a large
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
