Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Ernie Chu; Vishal M. Patel

arXiv:2603.14794·cs.CV·April 1, 2026

Face-to-Face: A Video Dataset for Multi-Person Interaction Modeling

Ernie Chu, Vishal M. Patel

PDF

1 Repo

TL;DR

This paper introduces a new 70-hour video dataset of two-person interactions, enabling modeling of conversational dynamics and reactive behaviors in multi-person settings.

Contribution

The paper presents a novel dataset, a semi-automatic extraction pipeline, and baseline experiments for modeling dyadic, sequential human interactions.

Findings

01

Cross-person visual context improves emotion and FVD metrics.

02

The dataset enables reactive, speech-driven avatar generation.

03

Baseline models preserve lip-sync quality with visual context conditioning.

Abstract

Modeling the reactive tempo of human conversation remains difficult because most audio-visual datasets portray isolated speakers delivering short monologues. We introduce \textbf{Face-to-Face with Jimmy Fallon (F2F-JF)}, a 70-hour, 14k-clip dataset of two-person talk-show exchanges that preserves the sequential dependency between a guest turn and the host's response. A semi-automatic pipeline combines multi-person tracking, speech diarization, and lightweight human verification to extract temporally aligned host/guest tracks with tight crops and metadata that are ready for downstream modeling. We showcase the dataset with a reactive, speech-driven digital avatar task in which the host video during $[t_{1}, t_{2}]$ is generated from their audio plus the guest's preceding video during $[t_{0}, t_{1}]$ . Conditioning a MultiTalk-style diffusion model on this cross-person visual context yields small…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://face2face2026.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.