Words Worth a Thousand Pictures: Measuring and Understanding Perceptual   Variability in Text-to-Image Generation

Raphael Tang; Xinyu Zhang; Lixinyu Xu; Yao Lu; Wenyan Li; Pontus; Stenetorp; Jimmy Lin; Ferhan Ture

arXiv:2406.08482·cs.CV·November 27, 2024

Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation

Raphael Tang, Xinyu Zhang, Lixinyu Xu, Yao Lu, Wenyan Li, Pontus, Stenetorp, Jimmy Lin, Ferhan Ture

PDF

Open Access 1 Video

TL;DR

This paper introduces W1KP, a human-calibrated perceptual variability measure for diffusion-based text-to-image models, revealing how prompts influence image diversity and reusability, and analyzing linguistic factors affecting variability.

Contribution

We propose W1KP, a novel perceptual variability metric calibrated with human judgments, and provide the first analysis of diffusion model variability from a visuolinguistic perspective.

Findings

01

W1KP outperforms nine baselines by up to 18 points in accuracy.

02

Calibration matches human judgments 78% of the time.

03

Prompt reusability varies across models, with DALL-E 3 reusable 50-200 times.

Abstract

Diffusion models are the state of the art in text-to-image generation, but their perceptual variability remains understudied. In this paper, we examine how prompts affect image variability in black-box diffusion-based models. We propose W1KP, a human-calibrated measure of variability in a set of images, bootstrapped from existing image-pair perceptual distances. Current datasets do not cover recent diffusion models, thus we curate three test sets for evaluation. Our best perceptual distance outperforms nine baselines by up to 18 points in accuracy, and our calibration matches graded human judgements 78% of the time. Using W1KP, we study prompt reusability and show that Imagen prompts can be reused for 10-50 random seeds before new images become too similar to already generated images, while Stable Diffusion XL and DALL-E 3 can be reused 50-200 times. Lastly, we analyze 56 linguistic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Words Worth a Thousand Pictures: Measuring and Understanding Perceptual Variability in Text-to-Image Generation· underline

Taxonomy

TopicsComputer Graphics and Visualization Techniques

MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training · Diffusion