Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang

TL;DR
This paper introduces TIMAR, a causal modeling framework for 3D conversational head dynamics that effectively captures bidirectional multimodal interactions, improving temporal coherence and expressiveness in animated avatars.
Contribution
TIMAR is the first turn-level causal framework that models interleaved audio-visual dialogue dynamics for 3D head generation, enhancing temporal coherence and expressive variability.
Findings
Reduces Fréchet Distance and MSE by 15-30% on the DualTalk benchmark.
Achieves similar performance gains on out-of-distribution data.
Source code is publicly available.
Abstract
Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Emotion and Mood Recognition · Social Robot Interaction and HRI
