Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal; Akinniyi Akinyemi; Kathryn Alvero; Morteza Behrooz; Julia Buffalini; Fabio Maria Carlucci; Joy Chen; Junming Chen; Zhang Chen; Shiyang Cheng; Praveen Chowdary; Joe Chuang; Antony D'Avirro; Jon Daly; Ning Dong; Mark Duppenthaler; Cynthia Gao; Jeff Girard; Martin Gleize; Sahir Gomez; Hongyu Gong; Srivathsan Govindarajan; Brandon Han; Sen He; Denise Hernandez; Yordan Hristov; Rongjie Huang; Hirofumi Inaguma; Somya Jain; Raj Janardhan; Qingyao Jia; Christopher Klaiber; Dejan Kovachev; Moneish Kumar; Hang Li; Yilei Li; Pavel Litvin; Wei Liu; Guangyao Ma; Jing Ma; Martin Ma; Xutai Ma; Lucas Mantovani; Sagar Miglani; Sreyas Mohan; Louis-Philippe Morency; Evonne Ng; Kam-Woh Ng; Tu Anh Nguyen; Amia Oberai; Benjamin Peloquin; Juan Pino; Jovan Popovic; Omid Poursaeed; Fabian Prada; Alice Rakotoarison; Rakesh Ranjan; Alexander Richard; Christophe Ropers; Safiyyah Saleem; Vasu Sharma; Alex Shcherbyna; Jia Shen; Jie Shen; Anastasis Stathopoulos; Anna Sun; Paden Tomasello; Tuan Tran; Arina Turkatenko; Bo Wan; Chao Wang; Jeff Wang; Mary Williamson; Carleigh Wood; Tao Xiang; Yilin Yang; Julien Yao; Chen Zhang; Jiemin Zhang; Xinyue Zhang; Jason Zheng; Pavlo Zhyzheria; Jan Zikes; Michael Zollhoefer

arXiv:2506.22554·cs.CV·July 2, 2025

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Vasu Agrawal, Akinniyi Akinyemi, Kathryn Alvero, Morteza Behrooz, Julia Buffalini, Fabio Maria Carlucci, Joy Chen, Junming Chen, Zhang Chen, Shiyang Cheng, Praveen Chowdary, Joe Chuang, Antony D'Avirro, Jon Daly, Ning Dong, Mark Duppenthaler, Cynthia Gao, Jeff Girard

PDF

TL;DR

This paper introduces a large-scale audiovisual dataset and models for understanding and generating dyadic human interactions, advancing socially intelligent AI with applications in virtual agents and multimodal communication.

Contribution

It provides a new extensive dataset and develops models for dyadic motion and facial expression generation aligned with speech, including controllable and multimodal integration capabilities.

Findings

01

The dataset contains over 4,000 hours of interaction footage from 4,000+ participants.

02

Models can generate synchronized gestures and facial expressions based on speech and visual cues.

03

Controllable models can adapt emotional responses and expressivity levels.

Abstract

Human communication involves a complex interplay of verbal and nonverbal signals, essential for conveying meaning and achieving interpersonal goals. To develop socially intelligent AI technologies, it is crucial to develop models that can both comprehend and generate dyadic behavioral dynamics. To this end, we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools. We also develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech. These models can take as input both the speech and visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.