VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
Jingtao Cao, Zheng Zhang, Hongru Wang, Kam-Fai Wong

TL;DR
VLEU is a novel evaluation metric for Text-to-Image models that measures their ability to handle diverse prompts by analyzing the distribution of generated images relative to input texts using large language models and CLIP.
Contribution
We introduce VLEU, a new metric leveraging large language models and CLIP to assess the generalizability of T2I models across diverse prompts, filling a gap in existing evaluation methods.
Findings
VLEU effectively measures T2I model generalization.
VLEU correlates well with model finetuning improvements.
VLEU distinguishes different T2I models based on prompt diversity.
Abstract
Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMathematics, Computing, and Information Processing
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
