InterDyad: Interactive Dyadic Speech-to-Video Generation by Querying Intermediate Visual Guidance
Dongwei Pan, Longwei Guo, Jiazhi Guan, Luying Huang, Yiding Li, Haojie Liu, Haocheng Feng, Wei He, Kaisiyuan Wang, Hang Zhou

TL;DR
InterDyad is a novel framework for interactive dyadic speech-to-video generation that combines motion guidance, linguistic intent extraction, and lip-sync enhancement to produce natural two-person interactions with improved control and realism.
Contribution
We introduce InterDyad, a comprehensive framework integrating motion guidance, modality alignment, and lip-sync improvements for dyadic speech-to-video synthesis.
Findings
Outperforms state-of-the-art in naturalness and contextual accuracy
Enhances lip-sync quality under extreme head poses
Provides a new evaluation suite with novel metrics
Abstract
Despite progress in speech-to-video synthesis, existing methods often struggle to capture cross-individual dependencies and provide fine-grained control over reactive behaviors in dyadic settings. To address these challenges, we propose InterDyad, a framework that enables naturalistic interactive dynamics synthesis via querying structural motion guidance. Specifically, we first design an Interactivity Injector that achieves video reenactment based on identity-agnostic motion priors extracted from reference videos. Building upon this, we introduce a MetaQuery-based modality alignment mechanism to bridge the gap between conversational audio and these motion priors. By leveraging a Multimodal Large Language Model (MLLM), our framework is able to distill linguistic intent from audio to dictate the precise timing and appropriateness of reactions. To further improve lip-sync quality under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Speech and Audio Processing · Face recognition and analysis
