From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Yatai Ji; Teng Wang; Yuying Ge; Zhiheng Liu; Sidi Yang; Ying Shan; Ping Luo

arXiv:2510.19871·cs.CL·October 24, 2025

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Yatai Ji, Teng Wang, Yuying Ge, Zhiheng Liu, Sidi Yang, Ying Shan, Ping Luo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces ReDiff, a novel framework that transforms vision-language diffusion models from passive denoisers into active refiner models, significantly enhancing generation coherence and accuracy by enabling self-correction during the diffusion process.

Contribution

ReDiff is a new diffusion framework that incorporates self-correction and refinement, addressing the train-inference discrepancy in vision-language models.

Findings

01

ReDiff improves coherence and factual accuracy of generated content.

02

ReDiff enables stable, efficient parallel generation surpassing traditional methods.

03

The approach effectively breaks the error cascade in diffusion models.

Abstract

Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well-structured and well-written, with effective visuals to illustrate both conceptual and qualitative outcomes. 2. This paper proposes a novel paradigm that moves from denoising to refining, which reconceptualizes how discrete diffusion models perform generation. The model explicitly learns from its own flawed drafts, rather than synthetic noise alone, which is an elegant and practical innovation. 3. The empirical performance is good over the baselines. The ablation studies also

Weaknesses

1. The self-correction loop relies on an external “expert model” (o4-mini) for generating corrected drafts. While practical, this introduces external bias and resource dependence. The paper could discuss how results vary with different or weaker expert models. 2. The evaluation focuses only on detailed image captioning. Although this is a strong proxy task, extending to other vision-language generation tasks (e.g., dialog, instruction following) would test generalization. 3. The two-stage traini

Reviewer 02Rating 2Confidence 3

Strengths

The problem is clearly defined, and the writing is good. Secondly, it seems intuitive why the method can address the problem to some extent (e.g., incoherence). For factual errors, the problem might not be fully addressed. It is largely an inherent weakness of data-driven models.

Weaknesses

The whole method seems to be a combination of knowledge distillation and self-supervised learning, making it less novel to me; The structure of the paper can be improved. I believe it is more appropriate to place the preliminary section outside the method section. However, by doing this, the method will look much less complicated for a top conference. The authors might need to consider how to dig deeper into the problem; Some of the claims about experiments also look quite strong to me, for exam

Reviewer 03Rating 6Confidence 3

Strengths

- The authors redesign the generation process using a discrete diffusion model from passive noise reduction to active refinement, explicitly targeting the discrepancy between training and inference that leads to error cascades in parallel decoding.The idea of “refining already unmasked tokens while simultaneously unmasking new ones” is an intuitive and clear conceptual change. - This article clearly explains why discrete diffusion has problems with parallel decoding (context distortion due to

Weaknesses

- The online self-correction learning highly depends on an external expert (o4-mini). The authors generate approximately 10k draft-refined caption pairs per round, with “a single round” being considered the most effective. Thus, the trained ReDiff cannot be free of the prior knowledge of the external expert, underscoring its marginal performance. In addition, the data/computing costs, input template, and quality control for expert feedback are not quantified. - All evaluations refer to detailed

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning