Training-Free Self-Correction for Multimodal Masked Diffusion Models
Yidong Ouyang, Panwen Hu, Zhengyan Wan, Zhe Wang, Liyan Xie, Dmitriy Bespalov, Ying Nian Wu, Guang Cheng, Hongyuan Zha, Qiang Sun

TL;DR
This paper introduces a training-free self-correction method for masked diffusion models that enhances generation quality and efficiency without additional training or auxiliary evaluators, applicable across various architectures.
Contribution
It proposes a novel, training-free self-correction framework that leverages pre-trained masked diffusion models to improve multimodal generation tasks.
Findings
Significantly improves text-to-image generation quality.
Reduces sampling steps needed for high-quality outputs.
Generalizes across different masked diffusion architectures.
Abstract
Masked diffusion models have emerged as a powerful framework for text and multimodal generation. However, their sampling procedure updates multiple tokens simultaneously and treats generated tokens as immutable, which may lead to error accumulation when early mistakes cannot be revised. In this work, we revisit existing self-correction methods and identify limitations stemming from additional training requirements or reliance on misaligned likelihood estimates. We propose a training-free self-correction framework that exploits the inductive biases of pre-trained masked diffusion models. Without modifying model parameters or introducing auxiliary evaluators, our method significantly improves generation quality on text-to-image generation and multimodal understanding tasks with reduced sampling steps. Moreover, the proposed framework generalizes across different masked diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Model Reduction and Neural Networks
