RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine   Semantic Re-alignment

Zutao Jiang; Guian Fang; Jianhua Han; Guansong Lu; Hang Xu; Shengcai; Liao; Xiaojun Chang; Xiaodan Liang

arXiv:2305.19599·cs.CV·October 25, 2024·1 cites

RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment

Zutao Jiang, Guian Fang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai, Liao, Xiaojun Chang, Xiaodan Liang

PDF

Open Access

TL;DR

RealignDiff introduces a two-stage coarse-to-fine semantic re-alignment approach for text-to-image diffusion models, significantly enhancing the alignment between generated images and textual prompts, leading to improved visual quality and semantic accuracy.

Contribution

The paper proposes a novel two-stage re-alignment method using BLIP-2 and local dense captioning to better align images with text prompts in diffusion models.

Findings

01

Outperforms baseline re-alignment methods in visual quality

02

Achieves higher semantic similarity on MS-COCO and ViLG-300 datasets

03

Demonstrates effectiveness of coarse-to-fine re-alignment approach

Abstract

Recent advances in text-to-image diffusion models have achieved remarkable success in generating high-quality, realistic images from textual descriptions. However, these approaches have faced challenges in precisely aligning the generated visual content with the textual concepts described in the prompts. In this paper, we propose a two-stage coarse-to-fine semantic re-alignment method, named RealignDiff, aimed at improving the alignment between text and images in text-to-image diffusion models. In the coarse semantic re-alignment phase, a novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the semantic discrepancy between the generated image caption and the given text prompt. Subsequently, the fine semantic re-alignment stage employs a local dense caption generation module and a re-weighting attention modulation module to refine the previously generated images…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Image Retrieval and Classification Techniques

Methodsfail · Diffusion