Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion
Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu

TL;DR
This paper introduces PPAD, a novel framework that uses a Multimodal Large Language Model to analyze and correct semantic errors during the diffusion process in text-to-image generation, improving alignment and quality.
Contribution
It is the first to incorporate real-time semantic supervision via MLLM during diffusion, enabling active correction with minimal steps and enhancing image fidelity.
Findings
Significant improvement in prompt-image alignment.
Effective semantic correction with few diffusion steps.
Versatile framework supporting inference-only and training modes.
Abstract
Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsDiffusion
