Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen; Fei Wang; Zhihao Huang; Qing Zhou; Kun Li; Dan Guo; Linfeng Zhang; Xun Yang

arXiv:2512.15340·cs.CV·February 27, 2026

Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang

PDF

Open Access

TL;DR

This paper introduces TIMAR, a causal modeling framework for 3D conversational head dynamics that effectively captures bidirectional multimodal interactions, improving temporal coherence and expressiveness in animated avatars.

Contribution

TIMAR is the first turn-level causal framework that models interleaved audio-visual dialogue dynamics for 3D head generation, enhancing temporal coherence and expressive variability.

Findings

01

Reduces Fréchet Distance and MSE by 15-30% on the DualTalk benchmark.

02

Achieves similar performance gains on out-of-distribution data.

03

Source code is publicly available.

Abstract

Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Emotion and Mood Recognition · Social Robot Interaction and HRI