Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin; Deepak Pathak; Baiqi Li; Jiayao Li; Xide Xia; and Graham Neubig; Pengchuan Zhang; Deva Ramanan

arXiv:2404.01291·cs.CV·June 19, 2024·3 cites

Evaluating Text-to-Visual Generation with Image-to-Text Generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, and Graham Neubig, Pengchuan Zhang, Deva Ramanan

PDF

Open Access 3 Repos 5 Models

TL;DR

This paper introduces VQAScore, a new evaluation metric for text-to-visual generation that uses visual-question-answering models to better assess complex image-text alignment, outperforming existing metrics.

Contribution

The paper proposes VQAScore, a novel VQA-based metric for evaluating image-text alignment, and introduces GenAI-Bench, a challenging benchmark with human ratings for diverse generative models.

Findings

01

VQAScore achieves state-of-the-art results across 8 benchmarks.

02

CLIP-FlanT5 outperforms GPT-4V-based baselines.

03

VQAScore can align text with video and 3D models.

Abstract

Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Humanities and Scholarship

MethodsDiffusion · ALIGN · Contrastive Language-Image Pre-training