MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation
Xinyan Ye, Jiankang Deng, Abbas Edalat

TL;DR
MoCoTalk is a novel multi-conditional diffusion framework for controllable talking head generation, integrating multiple control signals with an adaptive routing mechanism for improved realism and attribute control.
Contribution
It introduces an Adaptive Multi-Condition Router and a Mouth-Augmented Shading Mesh for better multi-condition fusion and speech-related facial dynamics modeling.
Findings
Achieves state-of-the-art performance on multiple metrics.
Provides attribute-level controllability beyond single-condition methods.
Demonstrates effective fusion of heterogeneous control signals.
Abstract
Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
