UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools
Sam Jung, Agustin Garcinuno, Spencer Mateega

TL;DR
UI-Bench is a comprehensive benchmark that evaluates the visual quality of AI text-to-app tools through expert comparisons, establishing a standard for AI-driven web design evaluation.
Contribution
It introduces the first large-scale, reproducible benchmark with a ranking system for AI text-to-app tools, including an open-source framework and public leaderboard.
Findings
UI-Bench evaluates 10 tools across 30 prompts and 300 sites.
The benchmark uses a TrueSkill model for ranking with confidence intervals.
It provides a reproducible standard and resources for future AI web design research.
Abstract
AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across competing AI text-to-app tools through expert pairwise comparison. Spanning 10 tools, 30 prompts, 300 generated sites, and 4,000+ expert judgments, UI-Bench ranks systems with a TrueSkill-derived model that yields calibrated confidence intervals. UI-Bench establishes a reproducible standard for advancing AI-driven web design. We release (i) the complete prompt set, (ii) an open-source evaluation framework, and (iii) a public leaderboard. The generated sites rated by participants will be released soon. View the UI-Bench leaderboard at https://uibench.ai/leaderboard.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
