Talking Together: Synthesizing Co-Located 3D Conversations from Audio

Mengyi Shan; Shouchieh Chang; Ziqian Bai; Shichen Liu; Yinda Zhang; Luchuan Song; Rohit Pandey; Sean Fanello; Zeng Huang

arXiv:2603.08674·cs.CV·March 10, 2026

Talking Together: Synthesizing Co-Located 3D Conversations from Audio

Mengyi Shan, Shouchieh Chang, Ziqian Bai, Shichen Liu, Yinda Zhang, Luchuan Song, Rohit Pandey, Sean Fanello, Zeng Huang

PDF

Open Access

TL;DR

This paper presents a novel system for generating realistic, spatially-aware 3D facial animations of two interacting people from audio, capturing their dynamic positions, gaze, and lip-sync for immersive VR and telepresence.

Contribution

It introduces a dual-stream architecture with cross-attention and a new eye gaze loss, along with a large-scale conversational dataset, enabling realistic and controllable 3D dyadic animations from audio.

Findings

01

Outperforms existing baselines in realism and interaction coherence.

02

Generates fluid, controllable, spatially-aware dyadic animations.

03

Successfully models mutual gaze and head pose dynamics.

Abstract

We tackle the challenging task of generating complete 3D facial animations for two interacting, co-located participants from a mixed audio stream. While existing methods often produce disembodied "talking heads" akin to a video conference call, our work is the first to explicitly model the dynamic 3D spatial relationship -- including relative position, orientation, and mutual gaze -- that is crucial for realistic in-person dialogues. Our system synthesizes the full performance of both individuals, including precise lip-sync, and uniquely allows their relative head poses to be controlled via textual descriptions. To achieve this, we propose a dual-stream architecture where each stream is responsible for one participant's output. We employ speaker's role embeddings and inter-speaker cross-attention mechanisms designed to disentangle the mixed audio and model the interaction. Furthermore,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Speech and Audio Processing · Generative Adversarial Networks and Image Synthesis