VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation

Kaiyuan Jiang; Ruoxi Sun; Ying Cao; Yuqi Xu; Xinran Zhang; Junyan Guo; ChengSheng Deng

arXiv:2508.06152·cs.CV·August 11, 2025

VISTAR:A User-Centric and Role-Driven Benchmark for Text-to-Image Evaluation

Kaiyuan Jiang, Ruoxi Sun, Ying Cao, Yuqi Xu, Xinran Zhang, Junyan Guo, ChengSheng Deng

PDF

Open Access

TL;DR

VISTAR is a comprehensive, user-centric benchmark for text-to-image evaluation combining physical attribute metrics and semantic assessment via vision-language models, validated through extensive expert input and human comparisons.

Contribution

It introduces a novel two-tier evaluation scheme with a hierarchical questioning approach and defines a multi-dimensional benchmark grounded in expert consensus and human validation.

Findings

01

Metrics achieve over 75% human alignment.

02

HWPQ scheme reaches 85.9% accuracy on semantics.

03

Role-based scoring alters model rankings.

Abstract

We present VISTAR, a user-centric, multi-dimensional benchmark for text-to-image (T2I) evaluation that addresses the limitations of existing metrics. VISTAR introduces a two-tier hybrid paradigm: it employs deterministic, scriptable metrics for physically quantifiable attributes (e.g., text rendering, lighting) and a novel Hierarchical Weighted P/N Questioning (HWPQ) scheme that uses constrained vision-language models to assess abstract semantics (e.g., style fusion, cultural fidelity). Grounded in a Delphi study with 120 experts, we defined seven user roles and nine evaluation angles to construct the benchmark, which comprises 2,845 prompts validated by over 15,000 human pairwise comparisons. Our metrics achieve high human alignment (>75%), with the HWPQ scheme reaching 85.9% accuracy on abstract semantics, significantly outperforming VQA baselines. Comprehensive evaluation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques