TL;DR
JAM-Flow is a unified flow-based framework that simultaneously synthesizes facial motion and speech, enabling diverse audio-visual generation tasks with cross-modal conditioning.
Contribution
It introduces a novel Multi-Modal Diffusion Transformer architecture with specialized modules and training objectives for integrated audio-visual synthesis.
Findings
Supports text, audio, and motion conditioning for talking head generation.
Achieves synchronized audio-visual synthesis within a single model.
Significantly improves multi-modal generative capabilities.
Abstract
The intrinsic link between facial motion and speech is often overlooked in generative modeling, where talking head synthesis and text-to-speech (TTS) are typically addressed as separate tasks. This paper introduces JAM-Flow, a unified framework to simultaneously synthesize and condition on both facial motion and speech. Our approach leverages flow matching and a novel Multi-Modal Diffusion Transformer (MM-DiT) architecture, integrating specialized Motion-DiT and Audio-DiT modules. These are coupled via selective joint attention layers and incorporate key architectural choices, such as temporally aligned positional embeddings and localized joint attention masking, to enable effective cross-modal interaction while preserving modality-specific strengths. Trained with an inpainting-style objective, JAM-Flow supports a wide array of conditioning inputs-including text, reference audio, and…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The idea to unify speech and motion synthesis in a single model is interesting. And the architecture design including joint attention and masking is reasonable. 2. Proposed inpainting-based training enables flexible conditioning under missing modalities.
1. In the experiment, the improvements over baselines are marginal or inconsistent. In TTS, WER even increases compared with baselines. 2. The scalability and generalization of proposed framework are not convincingly discussed. How the method performs with longer sequences, different speakers or higher-resolution videos? 3. The evaluation is limited. Metrics such as LSE-C/LSE-D are known to be unstable and no strong multimodal baselines are provided for fair comparison.
1. Consolidating multiple functionalities into a unified model enhances flexibility and improves usability for uses. 2. The authors have made commendable efforts in both model functionality and experimental design, which may offer certain insights for future research. 3. The analysis of mouth keypoints in LivePortrait contributes to improved lip-speech consistency.
1. The authors claim to propose the first unified framework for TTS and THG, yet the motivation and significance of such integration are not clearly articulated. In typical THG pipelines, the driving signals—whether speech, video, or text—are already well-supported by a range of mature generation techniques, which raises questions about the practical necessity of unifying these components. 2. The combination of TTS and THG in this work appears more akin to an engineering optimization effort tha
* joint modeling direction: The idea of training a single flow-matching framework to co-model both speech and facial motion is conceptually appealing and targets an underexplored space between talking-head generation and TTS systems. * Architectural clarity: The paper carefully describes how Motion-DiT and Audio-DiT interact through partial joint attention, cross-modal RoPE alignment, and modality-specific attention masking, all of which are reasonable and technically sound design choices. * St
1. Partial joint attention design not well justified: Only half of the layers are fused via joint attention, but the paper does not explain how this number was selected, whether more or fewer fusion layers were tested, or how fusion depth affects stability and performance. 2. Unclear practical benefits of joint modeling: While the paper argues that jointly modeling speech and motion reflects the natural coupling of human communication, it is unclear what measurable benefit this joint training pr
1. The paper is clear and easy to follow, with well-structured writing and visuals. 2. It cites relevant work and explains design choices logically, making the reasoning convincing. 3. The flexible input setting is an interesting and practical aspect of the framework.
1. The paper does not provide clear quantitative evidence on how joint training improves performance over unimodal setups. Based on the current presentation, the main benefit of joint supervision seems to lie in enabling flexible input configurations rather than delivering measurable quality gains. 2. The effectiveness of the proposed attention masking is discussed qualitatively, but additional quantitative analysis would help clarify its impact on temporal alignment and overall performance. 3.
1. The proposed method is novel. It is the first joint training framework for talking head and TTS generation. 2. The paper is well written and easy to follow.
1. The performance in demo video is not satisfactory. The teaser video exhibits significant artifacts. 2. The paper lacks comparisons with recent methods, such as EDTalk. 3. Since the paper only generate mouth motions, it should compare the method with visual dubbing methods (which also only generate mouth motions), such as wav2vec, stylesync.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Generative Adversarial Networks and Image Synthesis · Social Robot Interaction and HRI
