GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
Baiqi Li, Zhiqiu Lin, Deepak Pathak, Jiayao Li, Yixin Fei, Kewen Wu,, Tiffany Ling, Xide Xia, Pengchuan Zhang, Graham Neubig, Deva Ramanan

TL;DR
This paper evaluates the performance of text-to-visual models on compositional prompts, introduces VQAScore as an effective evaluation metric, and releases a new benchmark with extensive human ratings to advance the field.
Contribution
It presents a comprehensive human study on compositional text-to-visual generation, introduces VQAScore for improved evaluation, and releases a large benchmark dataset for future research.
Findings
VQAScore outperforms previous metrics like CLIPScore.
Ranking with VQAScore significantly improves human alignment.
The new GenAI-Rank benchmark contains over 40,000 human ratings.
Abstract
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Digital Storytelling and Education
MethodsDiffusion
