MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

TL;DR
This paper introduces MMaDA-Parallel, a multimodal diffusion model that enhances thinking-aware image generation by improving cross-modal alignment and semantic consistency through a parallel, bidirectional framework and reinforcement learning.
Contribution
It proposes a novel parallel multimodal diffusion framework with a new benchmark and reinforcement learning strategy to address error propagation in thinking-aware generation.
Findings
Achieves 6.9% improvement in Output Alignment on ParaBench.
Significantly enhances cross-modal alignment and semantic consistency.
Validates effectiveness through comprehensive experiments.
Abstract
While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well structured and clearly motivated. It offers a thorough investigation that includes new benchmarking, curated training datasets, new model design, and an RL protocol aimed at addressing the problem. The proposed MMaDA-Parallel model performs on par with SOTA open-source model that is trained on more data.
Normally, the decoding process only has one scheduler. In this paper, two schedulers are used for each modality. Could the authors give a more systematic guarantee of why we can assume the independence between each modality and why the alignment of text image modality would enable independent, parallel generation to work better than any-order joint generation?
1. The proposed ParaBench is an effective tool for the analysis of thinking-aware image synthesis. The finding regarding the strong correlation between performance degradation and poor alignment between the generated modalities is interesting and insightful. 2. This paper explains the method design clearly, with insightful motivations. 3. Visualization results in Figure 5 are impressive, showing the improvement in challenging scenarios such as the compositional settings. 4. The paper is well-
1. This paper shows the improvements of the paper based on the proposed ParaBench (Table 2). How about standard public benchmarks used in the original Bagel paper, such as GenEval[1], WISE[2], GEdit-Bench[3]? [1] Geneval: An object-focused framework for evaluating text-to-image alignment. NeurIPS, 2023. [2] Wise: A world knowledge-informed semantic evaluation for text-to-image generation. arXiv preprint arXiv:2503.07265, 2025. [3] Step1x-edit: A practical framework for general image editing
1.The paper provides a clear and insightful analysis of failure modes in thinking-aware multimodal generation, highlighting a real problem in current reasoning–generation pipelines. 2.ParaBench offers a valuable evaluation framework that jointly measures reasoning and image alignment, which could be useful for future multimodal research. 3.The proposed parallel diffusion framework with stepwise semantic optimization is well-motivated and achieves measurable gains in cross-modal alignment.
1. The method novelty is limited — the proposed MMaDA-Parallel and ParaRL mainly combine existing ideas of diffusion fine-tuning and reinforcement learning under a parallel setting. 2.The reported 6.9% improvement in output alignment, while positive, appears modest given the additional complexity and training cost. The authors are encouraged to provide stronger evidence that this gain is statistically or practically significant, or to further improve the results through more comprehensive exper
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
