Commonsense-T2I Challenge: Can Text-to-Image Generation Models   Understand Commonsense?

Xingyu Fu; Muyu He; Yujie Lu; William Yang Wang; Dan Roth

arXiv:2406.07546·cs.CV·August 14, 2024·1 cites

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Commonsense-T2I, a benchmark to evaluate text-to-image models' ability to generate images consistent with real-world commonsense, revealing significant gaps even in state-of-the-art models.

Contribution

It presents a new adversarial benchmark dataset for assessing commonsense reasoning in T2I models and provides a comprehensive evaluation of current models' performance.

Findings

01

State-of-the-art models achieve less than 50% accuracy

02

GPT-enriched prompts do not significantly improve results

03

There is a large gap between generated images and real photos

Abstract

We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that align with commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

CommonsenseT2I/CommonsensenT2I
dataset· 238 dl
238 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Mathematics, Computing, and Information Processing

MethodsSparse Evolutionary Training · ALIGN · Diffusion