MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

Xinyan Ye; Jiankang Deng; Abbas Edalat

arXiv:2605.08050·cs.CV·May 11, 2026

MoCoTalk: Multi-Conditional Diffusion with Adaptive Router for Controllable Talking Head Generation

Xinyan Ye, Jiankang Deng, Abbas Edalat

PDF

TL;DR

MoCoTalk is a novel multi-conditional diffusion framework for controllable talking head generation, integrating multiple control signals with an adaptive routing mechanism for improved realism and attribute control.

Contribution

It introduces an Adaptive Multi-Condition Router and a Mouth-Augmented Shading Mesh for better multi-condition fusion and speech-related facial dynamics modeling.

Findings

01

Achieves state-of-the-art performance on multiple metrics.

02

Provides attribute-level controllability beyond single-condition methods.

03

Demonstrates effective fusion of heterogeneous control signals.

Abstract

Talking-head generation requires joint modeling of identity, head pose, facial expression, and mouth dynamics. Existing methods typically address only a subset of these factors, and rely on fixed-weight or heuristic fusion when multiple conditions are involved. We present MoCoTalk, a multi-conditional video diffusion framework that unifies four complementary control signals: a reference image, facial keypoints, 3DMM-rendered shading meshes, and the corresponding speech audio. To resolve destructive interference among heterogeneous conditions, we introduce an Adaptive Multi-Condition Router that computes channel-wise, timestep-aware gating over the four condition streams, allowing the fusion strategy to vary with both feature subspace and noise level. To better capture speech-related facial dynamics, we design a Mouth-Augmented Shading Mesh, a 3DMM-based representation that decouples…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.