DALL-Eval: Probing the Reasoning Skills and Social Biases of   Text-to-Image Generation Models

Jaemin Cho; Abhay Zala; Mohit Bansal

arXiv:2202.04053·cs.CV·September 1, 2023·24 cites

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

Jaemin Cho, Abhay Zala, Mohit Bansal

PDF

Open Access 2 Repos 1 Models 1 Datasets 1 Video

TL;DR

This paper evaluates the reasoning abilities and social biases of state-of-the-art text-to-image models, revealing significant gaps in visual reasoning skills and the presence of biases learned from web data.

Contribution

It introduces PaintSkills, a diagnostic dataset for assessing visual reasoning, and provides a comprehensive analysis of biases in recent models.

Findings

01

Models perform poorly on object counting and spatial reasoning tasks.

02

Recent models exhibit gender and skin tone biases learned from web data.

03

The study highlights the need for improved reasoning and bias mitigation in text-to-image models.

Abstract

Recently, DALL-E, a multimodal transformer language model, and its variants, including diffusion models, have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
j-min/PaintSkills-DETR-R101-DC5
model

Datasets

j-min/PaintSkills
dataset· 543 dl
543 dl

Videos

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsDiffusion