From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model
Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo

TL;DR
This paper introduces ReDiff, a novel framework that transforms vision-language diffusion models from passive denoisers into active refiner models, significantly enhancing generation coherence and accuracy by enabling self-correction during the diffusion process.
Contribution
ReDiff is a new diffusion framework that incorporates self-correction and refinement, addressing the train-inference discrepancy in vision-language models.
Findings
ReDiff improves coherence and factual accuracy of generated content.
ReDiff enables stable, efficient parallel generation surpassing traditional methods.
The approach effectively breaks the error cascade in diffusion models.
Abstract
Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The paper is well-structured and well-written, with effective visuals to illustrate both conceptual and qualitative outcomes. 2. This paper proposes a novel paradigm that moves from denoising to refining, which reconceptualizes how discrete diffusion models perform generation. The model explicitly learns from its own flawed drafts, rather than synthetic noise alone, which is an elegant and practical innovation. 3. The empirical performance is good over the baselines. The ablation studies also
1. The self-correction loop relies on an external “expert model” (o4-mini) for generating corrected drafts. While practical, this introduces external bias and resource dependence. The paper could discuss how results vary with different or weaker expert models. 2. The evaluation focuses only on detailed image captioning. Although this is a strong proxy task, extending to other vision-language generation tasks (e.g., dialog, instruction following) would test generalization. 3. The two-stage traini
The problem is clearly defined, and the writing is good. Secondly, it seems intuitive why the method can address the problem to some extent (e.g., incoherence). For factual errors, the problem might not be fully addressed. It is largely an inherent weakness of data-driven models.
The whole method seems to be a combination of knowledge distillation and self-supervised learning, making it less novel to me; The structure of the paper can be improved. I believe it is more appropriate to place the preliminary section outside the method section. However, by doing this, the method will look much less complicated for a top conference. The authors might need to consider how to dig deeper into the problem; Some of the claims about experiments also look quite strong to me, for exam
- The authors redesign the generation process using a discrete diffusion model from passive noise reduction to active refinement, explicitly targeting the discrepancy between training and inference that leads to error cascades in parallel decoding.The idea of “refining already unmasked tokens while simultaneously unmasking new ones” is an intuitive and clear conceptual change. - This article clearly explains why discrete diffusion has problems with parallel decoding (context distortion due to
- The online self-correction learning highly depends on an external expert (o4-mini). The authors generate approximately 10k draft-refined caption pairs per round, with “a single round” being considered the most effective. Thus, the trained ReDiff cannot be free of the prior knowledge of the external expert, underscoring its marginal performance. In addition, the data/computing costs, input template, and quality control for expert feedback are not quantified. - All evaluations refer to detailed
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
