Evaluating Text-to-Visual Generation with Image-to-Text Generation
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, and Graham Neubig, Pengchuan Zhang, Deva Ramanan

TL;DR
This paper introduces VQAScore, a new evaluation metric for text-to-visual generation that uses visual-question-answering models to better assess complex image-text alignment, outperforming existing metrics.
Contribution
The paper proposes VQAScore, a novel VQA-based metric for evaluating image-text alignment, and introduces GenAI-Bench, a challenging benchmark with human ratings for diverse generative models.
Findings
VQAScore achieves state-of-the-art results across 8 benchmarks.
CLIP-FlanT5 outperforms GPT-4V-based baselines.
VQAScore can align text with video and 3D models.
Abstract
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Humanities and Scholarship
MethodsDiffusion · ALIGN · Contrastive Language-Image Pre-training
