Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback
Jaskirat Singh, Liang Zheng

TL;DR
This paper introduces a decompositional evaluation method using VQA feedback to improve text-to-image alignment, significantly outperforming existing metrics and enhancing image fidelity to complex prompts.
Contribution
It proposes a novel assertion-based alignment score and an iterative refinement process to improve text-to-image generation quality.
Findings
Alignment score correlates better with human ratings than CLIP.
Iterative refinement improves image alignment with complex prompts.
Achieves 8.7% higher accuracy over previous methods.
Abstract
The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods · Domain Adaptation and Few-Shot Learning
MethodsDiffusion · BLIP: Bootstrapping Language-Image Pre-training · Contrastive Language-Image Pre-training
