SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis

Rishav Pramanik; Ian E. Nielsen; Jeff Smith; Saurav Pandit; Ravi P. Ramachandran; Zhaozheng Yin

arXiv:2602.00249·cs.CV·February 3, 2026

SANEval: Open-Vocabulary Compositional Benchmarks with Failure-mode Diagnosis

Rishav Pramanik, Ian E. Nielsen, Jeff Smith, Saurav Pandit, Ravi P. Ramachandran, Zhaozheng Yin

PDF

Open Access 3 Reviews

TL;DR

SANEval introduces a scalable, open-vocabulary benchmark for evaluating complex compositional prompts in text-to-image models, combining deep prompt understanding with robust object detection to improve diagnostic capabilities.

Contribution

The paper presents SANEval, a novel benchmark that enables fine-grained, open-vocabulary evaluation of T2I models' compositional understanding using LLM-enhanced detection and automated diagnostics.

Findings

01

SANEval's evaluations correlate better with human judgments than existing benchmarks.

02

The benchmark effectively diagnoses failures in attribute binding, spatial relations, and numeracy.

03

Experiments on six state-of-the-art models demonstrate its robustness and diagnostic power.

Abstract

The rapid progress of text-to-image (T2I) models has unlocked unprecedented creative potential, yet their ability to faithfully render complex prompts involving multiple objects, attributes, and spatial relationships remains a significant bottleneck. Progress is hampered by a lack of adequate evaluation methods; current benchmarks are often restricted to closed-set vocabularies, lack fine-grained diagnostic capabilities, and fail to provide the interpretable feedback necessary to diagnose and remedy specific compositional failures. We solve these challenges by introducing SANEval (Spatial, Attribute, and Numeracy Evaluation), a comprehensive benchmark that establishes a scalable new pipeline for open-vocabulary compositional evaluation. SANEval combines a large language model (LLM) for deep prompt understanding with an LLM-enhanced, open-vocabulary object detector to robustly evaluate…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

* The paper introduces a well-structured benchmark that separately evaluates spatial reasoning, attribute binding, and numeracy, offering a more interpretable breakdown of compositional performance than prior holistic metrics. * By integrating LLM-based synonym expansion with an open-world detector (YOLO-E), SANEval effectively overcomes the fixed-class limitations of existing object-detection-based benchmarks. * The framework provides structured, human-readable feedback that identifies missin

Weaknesses

- The benchmark heavily relies on proprietary LLMs (e.g., Gemini-2.5-Flash) for both prompt parsing and evaluation, which may limit reproducibility. - The qualitative feedback examples remain limited, and it is unclear how consistently the diagnostic outputs generalize across diverse prompt domains. - The benchmark’s prompts are synthetically constructed and might not reflect the real-world user prompts especially the diversity aspect.

Reviewer 02Rating 6Confidence 3

Strengths

Strong diagnostic capability: SANEval goes beyond providing a single score—it outputs structured, interpretable feedback, explicitly identifying missing objects, incorrect attribute bindings, and count mismatches. This makes it highly useful for debugging and improving T2I systems. Open-source commitment: The authors plan to release the dataset, prompts, annotations, and full evaluation pipeline, which will greatly facilitate reproducibility and help standardize compositional evaluation in the

Weaknesses

Limited robustness analysis: The paper does not thoroughly examine how LLM parsing errors or object detection failures (e.g., hallucinations or missed detections) propagate through the pipeline and affect final scoring reliability. High computational cost: The evaluation pipeline requires multiple rounds of LLM calls and YOLO-E inference per image, which may make it expensive and impractical for large-scale evaluation on millions of samples. Insufficient prompt diversity: The dataset’s ~5000 p

Reviewer 03Rating 2Confidence 4

Strengths

1. Problem importance. The paper focuses on a real bottleneck in current T2I systems: controllability. Capturing whether a model got “two red cars to the left of a blue bus” right is directly relevant to downstream productization and safety of generative vision systems. Framing spatial relations, numeracy, and attribute binding as three core controllability axes is well-motivated. 2. Pipeline design / interpretability. The evaluation stack is modular and conceptually clean: prompt parsing → sy

Weaknesses

1. Reproducibility and stability are underdeveloped. The benchmark depends on proprietary or partially described components (e.g. Gemini-2.5-Flash for prompt parsing and attribute judgment, YOLO-E for open-vocabulary detection), some of which are not publicly reproducible. The paper promises release of data and code but does not convincingly demonstrate that the community will be able to run the full pipeline without access to closed-source commercial systems. 2. Limited validation of metric co

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications