TL;DR
This paper introduces a new 70-hour video dataset of two-person interactions, enabling modeling of conversational dynamics and reactive behaviors in multi-person settings.
Contribution
The paper presents a novel dataset, a semi-automatic extraction pipeline, and baseline experiments for modeling dyadic, sequential human interactions.
Findings
Cross-person visual context improves emotion and FVD metrics.
The dataset enables reactive, speech-driven avatar generation.
Baseline models preserve lip-sync quality with visual context conditioning.
Abstract
Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during is generated from their audio plus the guest's preceding video during . Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
