Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image   Alignment with Iterative VQA Feedback

Jaskirat Singh; Liang Zheng

arXiv:2307.04749·cs.CV·December 7, 2023·1 cites

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback

Jaskirat Singh, Liang Zheng

PDF

Open Access 1 Video

TL;DR

This paper introduces a decompositional evaluation method using VQA feedback to improve text-to-image alignment, significantly outperforming existing metrics and enhancing image fidelity to complex prompts.

Contribution

It proposes a novel assertion-based alignment score and an iterative refinement process to improve text-to-image generation quality.

Findings

01

Alignment score correlates better with human ratings than CLIP.

02

Iterative refinement improves image alignment with complex prompts.

03

Achieves 8.7% higher accuracy over previous methods.

Abstract

The field of text-conditioned image generation has made unparalleled progress with the recent advent of latent diffusion models. While remarkable, as the complexity of given text input increases, the state-of-the-art diffusion models may still fail in generating images which accurately convey the semantics of the given prompt. Furthermore, it has been observed that such misalignments are often left undetected by pretrained multi-modal models such as CLIP. To address these problems, in this paper we explore a simple yet effective decompositional approach towards both evaluation and improvement of text-to-image alignment. In particular, we first introduce a Decompositional-Alignment-Score which given a complex prompt decomposes it into a set of disjoint assertions. The alignment of each assertion with generated images is then measured using a VQA model. Finally, alignment scores for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Divide, Evaluate, and Refine: Evaluating and Improving Text-to-Image Alignment with Iterative VQA Feedback· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Computational and Text Analysis Methods · Domain Adaptation and Few-Shot Learning

MethodsDiffusion · BLIP: Bootstrapping Language-Image Pre-training · Contrastive Language-Image Pre-training