FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin David Hayes; Micah Goldblum; Vikash Sehwag; Gowthami Somepalli; Ashwinee Panda; Tom Goldstein

arXiv:2512.02161·cs.CV·December 3, 2025

FineGRAIN: Evaluating Failure Modes of Text-to-Image Models with Vision Language Model Judges

Kevin David Hayes, Micah Goldblum, Vikash Sehwag, Gowthami Somepalli, Ashwinee Panda, Tom Goldstein

PDF

Open Access

TL;DR

This paper introduces a hierarchical evaluation framework and dataset for assessing failure modes in text-to-image models using vision language model judges, revealing systematic errors and limitations of current metrics.

Contribution

It proposes a structured methodology and dataset for jointly evaluating T2I models and VLMs on specific failure modes, advancing model interpretability and reliability assessment.

Findings

01

VLMs can identify 27 specific failure modes in generated images.

02

Current metrics do not fully capture nuanced errors in attribute fidelity.

03

Systematic errors in object representation and attribute accuracy are prevalent.

Abstract

Text-to-image (T2I) models are capable of generating visually impressive images, yet they often fail to accurately capture specific attributes in user prompts, such as the correct number of objects with the specified colors. The diversity of such errors underscores the need for a hierarchical evaluation framework that can compare prompt adherence abilities of different image generation models. Simultaneously, benchmarks of vision language models (VLMs) have not kept pace with the complexity of scenes that VLMs are used to annotate. In this work, we propose a structured methodology for jointly evaluating T2I models and VLMs by testing whether VLMs can identify 27 specific failure modes in the images generated by T2I models conditioned on challenging prompts. Our second contribution is a dataset of prompts and images generated by 5 T2I models (Flux, SD3-Medium, SD3-Large, SD3.5-Medium,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Data Visualization and Analytics