MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian; Ling Yang; Jiongfan Yang; Anran Wang; Yu Tian; Jiani Zheng; Haochen Wang; Zhiyang Teng; Zhuochen Wang; Yinjie Wang; Yunhai Tong; Mengdi Wang; Xiangtai Li

arXiv:2511.09611·cs.CV·November 19, 2025

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

PDF

Open Access 2 Models 3 Reviews

TL;DR

This paper introduces MMaDA-Parallel, a multimodal diffusion model that enhances thinking-aware image generation by improving cross-modal alignment and semantic consistency through a parallel, bidirectional framework and reinforcement learning.

Contribution

It proposes a novel parallel multimodal diffusion framework with a new benchmark and reinforcement learning strategy to address error propagation in thinking-aware generation.

Findings

01

Achieves 6.9% improvement in Output Alignment on ParaBench.

02

Significantly enhances cross-modal alignment and semantic consistency.

03

Validates effectiveness through comprehensive experiments.

Abstract

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The paper is well structured and clearly motivated. It offers a thorough investigation that includes new benchmarking, curated training datasets, new model design, and an RL protocol aimed at addressing the problem. The proposed MMaDA-Parallel model performs on par with SOTA open-source model that is trained on more data.

Weaknesses

Normally, the decoding process only has one scheduler. In this paper, two schedulers are used for each modality. Could the authors give a more systematic guarantee of why we can assume the independence between each modality and why the alignment of text image modality would enable independent, parallel generation to work better than any-order joint generation?

Reviewer 02Rating 6Confidence 3

Strengths

1. The proposed ParaBench is an effective tool for the analysis of thinking-aware image synthesis. The finding regarding the strong correlation between performance degradation and poor alignment between the generated modalities is interesting and insightful. 2. This paper explains the method design clearly, with insightful motivations. 3. Visualization results in Figure 5 are impressive, showing the improvement in challenging scenarios such as the compositional settings. 4. The paper is well-

Weaknesses

1. This paper shows the improvements of the paper based on the proposed ParaBench (Table 2). How about standard public benchmarks used in the original Bagel paper, such as GenEval[1], WISE[2], GEdit-Bench[3]? [1] Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS, 2023. [2] Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025. [3] Step1x-edit: A practical framework for general image editing

Reviewer 03Rating 4Confidence 4

Strengths

1.The paper provides a clear and insightful analysis of failure modes in thinking-aware multimodal generation, highlighting a real problem in current reasoning–generation pipelines. 2.ParaBench offers a valuable evaluation framework that jointly measures reasoning and image alignment, which could be useful for future multimodal research. 3.The proposed parallel diffusion framework with stepwise semantic optimization is well-motivated and achieves measurable gains in cross-modal alignment.

Weaknesses

1. The method novelty is limited — the proposed MMaDA-Parallel and ParaRL mainly combine existing ideas of diffusion fine-tuning and reinforcement learning under a parallel setting. 2.The reported 6.9% improvement in output alignment, while positive, appears modest given the additional complexity and training cost. The authors are encouraged to provide stronger evidence that this gain is statistically or practically significant, or to further improve the results through more comprehensive exper

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling