InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Dongwei Pan; Longwei Guo; Jiazhi Guan; Luying Huang; Yiding Li; Haojie Liu; Haocheng Feng; Wei He; Kaisiyuan Wang; Hang Zhou

arXiv:2603.23132·cs.CV·March 25, 2026

InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance

Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou

PDF

Open Access

TL;DR

InterDyad is a novel framework for interactive dyadic speech-to-video generation that combines motion guidance, linguistic intent extraction, and lip-sync enhancement to produce natural two-person interactions with improved control and realism.

Contribution

We introduce InterDyad, a comprehensive framework integrating motion guidance, modality alignment, and lip-sync improvements for dyadic speech-to-video synthesis.

Findings

01

Outperforms state-of-the-art in naturalness and contextual accuracy

02

Enhances lip-sync quality under extreme head poses

03

Provides a new evaluation suite with novel metrics

Abstract

Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis