MoDA: Multi-modal Diffusion Architecture for Talking Head Generation
Xinyang Li, Gen Li, Zhihui Lin, Yichen Qian, GongXin Yao, Weinan Jia, Aowen Wang, Weihua Chen, Fan Wang

TL;DR
MoDA introduces a multi-modal diffusion architecture that enhances talking head generation by improving realism, diversity, and efficiency through innovative joint parameter space modeling and multi-modal fusion strategies.
Contribution
The paper proposes a novel multi-modal diffusion framework with a joint parameter space and a coarse-to-fine fusion strategy for improved talking head generation.
Findings
Increased video realism and diversity.
Enhanced facial expressiveness and head movements.
Reduced inference time and visual artifacts.
Abstract
Talking head generation with arbitrary identities and speech audio remains a crucial problem in the realm of the virtual metaverse. Recently, diffusion models have become a popular generative technique in this field with their strong generation capabilities. However, several challenges remain for diffusion-based methods: 1) inefficient inference and visual artifacts caused by the implicit latent space of Variational Auto-Encoders (VAE), which complicates the diffusion process; 2) a lack of authentic facial expressions and head movements due to inadequate multi-modal information fusion. In this paper, MoDA handles these challenges by: 1) defining a joint parameter space that bridges motion generation and neural rendering, and leveraging flow matching to simplify diffusion learning; 2) introducing a multi-modal diffusion architecture to model the interaction among noisy motion, audio, and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper is well organized and easy to follow. - The proposed multi-stage fusion design is interesting and shows clear performance improvements.
1. The coarse-to-fine stage setup feels somewhat heuristic, and it is not entirely clear why identity, audio, and emotion are treated as “coarse” features. 2. While the proposed fusion strategy empirically improves performance, there is little analysis quantifying why each fusion step (e.g., merging audio with emotion and identity first) is optimal. The explanation remains largely intuitive, without supporting ablations or sensitivity analysis
- It is easy to read and the figures are easy to understand. - The joint-attention formulation in simple and addresses cross-modal alignment. - The coarse-to-fine (C2F) design is well-motivated and empirically supported. It reduces parameters dramatically, while improving sync and motion metrics. - The 10-step rectified-flow variant appears effective.
1. Evaluation scope is narrow Although public benchmarks provide established test splits, the paper evaluates on only 50 clips per benchmark and 20 in-the-wild examples. It is unclear why such a small subset as chosen despite the availability of full/standard test sets, and whether the 50 were truly random (or potentially cherry-picked). Please justify the subset size, describe the sampling protocol, and release the exact file indices to enable reproducibility. Also, the user study lacks detail
- Unified Multi-Modal Diffusion Framework : The paper introduces a well-structured MMDiT-based architecture that explicitly fuses motion, emotion, identity, and audio modalities. This design targets the persistent inter-modal inconsistency found in existing diffusion-based talking head methods. - Coarse-to-Fine Feature Fusion : The proposed C2F mechanism effectively balances modality-specific processing and shared fusion, improving both parameter efficiency and motion consistency. The ablation s
- Modality Design Justification : The inclusion of emotion, identity, audio, and motion as four separate modalities seems ad hoc. Other plausible combinations (e.g., audio + identity, audio + expression) are not tested. The paper lacks a systematic ablation on modality selection or their individual contribution. - Classifier-Free Guidance Reliance : Most conditional inputs except motion are used primarily as classifier-free guidance, which weakens the argument for deep joint modeling. This could
(1)It explicitly models the interactions among motion, audio, and auxiliary conditions, effectively mitigating stylistic inconsistencies across modalities and significantly enhancing overall generation quality. (2)The proposed Motion Transformer Block achieves efficient multi-modal feature fusion through modality-specific paths, adaptive normalization, and modulation mechanisms, leading to more natural and coherent talking head synthesis. (3)The MoDA architecture balances high-quality generati
(1)The experimental comparisons focus on Hallo2, without evaluating against more recent and advanced models such as Hallo3. (2)Although MoDA is relatively fast, its training still requires substantial computational resources. Demonstrating its performance on consumer-grade GPUs would further validate its efficiency.
- MoDA demonstrates improved performance over several existing methods (e.g., EchoMimic, JoyVASA) in terms of metrics like FVD, FID, and Sync. - The framework delivers high efficiency, supporting real-time inference, which makes it suitable for practical applications.
- Many of MoDA’s methods and techniques rely on existing models, such as multi-modal fusion and audio-to-video mapping, without offering significant innovations or breakthroughs. The framework primarily combines established approaches. - The methodology section is unclear, especially regarding the integration of multiple modalities (emotion, identity, audio) and how these components interact within the architecture. The technical details are underexplained, making it hard to understand the model
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
