It Takes Two: Real-time Co-Speech Two-person's Interaction Generation   via Reactive Auto-regressive Diffusion Model

Mingyi Shi; Dafei Qin; Leo Ho; Zhouyingcheng Liao; Yinghao Huang,; Junichi Yamagishi; Taku Komura

arXiv:2412.02419·cs.SD·December 4, 2024

It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model

Mingyi Shi, Dafei Qin, Leo Ho, Zhouyingcheng Liao, Yinghao Huang,, Junichi Yamagishi, Taku Komura

PDF

Open Access

TL;DR

This paper presents a novel real-time, auto-regressive diffusion model for generating interactive two-person co-speech motions, enabling online, dynamic, full-body character interactions driven by speech audio.

Contribution

It introduces the first online system for two-person co-speech motion synthesis using a diffusion-based model conditioned on speech and past states.

Findings

01

Outperforms existing methods in co-speech motion generation tasks

02

Successfully generates interactive full-body motions in real-time

03

Enriched datasets improve interaction diversity

Abstract

Conversational scenarios are very common in real-world settings, yet existing co-speech motion synthesis approaches often fall short in these contexts, where one person's audio and gestures will influence the other's responses. Additionally, most existing methods rely on offline sequence-to-sequence frameworks, which are unsuitable for online applications. In this work, we introduce an audio-driven, auto-regressive system designed to synthesize dynamic movements for two characters during a conversation. At the core of our approach is a diffusion-based full-body motion synthesis model, which is conditioned on the past states of both characters, speech audio, and a task-oriented motion trajectory input, allowing for flexible spatial control. To enhance the model's ability to learn diverse interactions, we have enriched existing two-person conversational motion datasets with more dynamic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis