Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Zheqi Lv; Junhao Chen; Qi Tian; Keting Yin; Shengyu Zhang; Fei Wu

arXiv:2505.20053·cs.CV·May 27, 2025

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion

Zheqi Lv, Junhao Chen, Qi Tian, Keting Yin, Shengyu Zhang, Fei Wu

PDF

Open Access 1 Repo

TL;DR

This paper introduces PPAD, a novel framework that uses a Multimodal Large Language Model to analyze and correct semantic errors during the diffusion process in text-to-image generation, improving alignment and quality.

Contribution

It is the first to incorporate real-time semantic supervision via MLLM during diffusion, enabling active correction with minimal steps and enhancing image fidelity.

Findings

01

Significant improvement in prompt-image alignment.

02

Effective semantic correction with few diffusion steps.

03

Versatile framework supporting inference-only and training modes.

Abstract

Diffusion models have become the mainstream architecture for text-to-image generation, achieving remarkable progress in visual quality and prompt controllability. However, current inference pipelines generally lack interpretable semantic supervision and correction mechanisms throughout the denoising process. Most existing approaches rely solely on post-hoc scoring of the final image, prompt filtering, or heuristic resampling strategies-making them ineffective in providing actionable guidance for correcting the generative trajectory. As a result, models often suffer from object confusion, spatial errors, inaccurate counts, and missing semantic elements, severely compromising prompt-image alignment and image quality. To tackle these challenges, we propose MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD), a novel framework that, for the first time, introduces a Multimodal Large…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hellozicky/ppad
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsDiffusion