Talking Together: Synthesizing Co-Located 3D Conversations from Audio
Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang

TL;DR
This paper presents a novel system for generating realistic, spatially-aware 3D facial animations of two interacting people from audio, capturing their dynamic positions, gaze, and lip-sync for immersive VR and telepresence.
Contribution
It introduces a dual-stream architecture with cross-attention and a new eye gaze loss, along with a large-scale conversational dataset, enabling realistic and controllable 3D dyadic animations from audio.
Findings
Outperforms existing baselines in realism and interaction coherence.
Generates fluid, controllable, spatially-aware dyadic animations.
Successfully models mutual gaze and head pose dynamics.
Abstract
We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis
