Relations, Negations, and Numbers: Looking for Logic in Generative   Text-to-Image Models

Colin Conwell; Rupert Tawiah-Quashie; and Tomer Ullman

arXiv:2411.17066·cs.CV·November 27, 2024

Relations, Negations, and Numbers: Looking for Logic in Generative Text-to-Image Models

Colin Conwell, Rupert Tawiah-Quashie, and Tomer Ullman

PDF

Open Access 1 Repo

TL;DR

This paper investigates the inability of state-of-the-art text-to-image models like DALL-E 3 to reliably understand and generate images based on logical operators such as relations, negations, and numbers, highlighting significant gaps in compositional reasoning.

Contribution

It provides a comprehensive evaluation of logical reasoning failures in modern generative models and proposes minimal modifications to improve their compositional capabilities.

Findings

01

Models struggle with negations and numbers beyond 3.

02

Grounded diffusion performs worse than DALL-E 3 on logical prompts.

03

Relational prompt frequency correlates with image match quality.

Abstract

Despite remarkable progress in multi-modal AI research, there is a salient domain in which modern AI continues to lag considerably behind even human children: the reliable deployment of logical operators. Here, we examine three forms of logical operators: relations, negations, and discrete numbers. We asked human respondents (N=178 in total) to evaluate images generated by a state-of-the-art image-generating AI (DALL-E 3) prompted with these `logical probes', and find that none reliably produce human agreement scores greater than 50\%. The negation probes and numbers (beyond 3) fail most frequently. In a 4th experiment, we assess a `grounded diffusion' pipeline that leverages targeted prompt engineering and structured intermediate representations for greater compositional control, but find its performance is judged even worse than that of DALL-E 3 across prompts. To provide further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

colinconwell/t2i-probology
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Humanities and Scholarship