UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung; Agustin Garcinuno; Spencer Mateega

arXiv:2508.20410·cs.CL·September 5, 2025

UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

Sam Jung, Agustin Garcinuno, Spencer Mateega

PDF

TL;DR

UI-Bench is a comprehensive benchmark that evaluates the visual quality of AI text-to-app tools through expert comparisons, establishing a standard for AI-driven web design evaluation.

Contribution

It introduces the first large-scale, reproducible benchmark with a ranking system for AI text-to-app tools, including an open-source framework and public leaderboard.

Findings

01

UI-Bench evaluates 10 tools across 30 prompts and 300 sites.

02

The benchmark uses a TrueSkill model for ranking with confidence intervals.

03

It provides a reproducible standard and resources for future AI web design research.

Abstract

AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.