GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual   Generation

Baiqi Li; Zhiqiu Lin; Deepak Pathak; Jiayao Li; Yixin Fei; Kewen Wu,; Tiffany Ling; Xide Xia; Pengchuan Zhang; Graham Neubig; Deva Ramanan

arXiv:2406.13743·cs.CV·November 5, 2024

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu,, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper evaluates the performance of text-to-visual models on compositional prompts, introduces VQAScore as an effective evaluation metric, and releases a new benchmark with extensive human ratings to advance the field.

Contribution

It presents a comprehensive human study on compositional text-to-visual generation, introduces VQAScore for improved evaluation, and releases a large benchmark dataset for future research.

Findings

01

VQAScore outperforms previous metrics like CLIPScore.

02

Ranking with VQAScore significantly improves human alignment.

03

The new GenAI-Rank benchmark contains over 40,000 human ratings.

Abstract

While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

BaiqiL/GenAI-Bench
dataset· 420 dl
420 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Digital Storytelling and Education

MethodsDiffusion