From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

Marcos Ortiz; Justin Hill; Collin Overbay; Ingrida Semenec; Frederic Sauve-Hoover; Jim Schwoebel; Joel Shor

arXiv:2512.18080·cs.HC·February 16, 2026

From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems

Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor

PDF

Open Access

TL;DR

This paper introduces a human-centered benchmark for evaluating prompt-to-app AI systems, comparing Replit, Bolt, and Firebase Studio through large-scale human studies to assess usability, visual appeal, trust, and completeness.

Contribution

It presents a new human-centered benchmark framework and conducts a comprehensive comparative study of three prominent prompt-to-app platforms using diverse prompts and human evaluations.

Findings

01

Firebase Studio outperforms others in all evaluated dimensions.

02

Bolt is competitive in visual appeal but lags in usability and trust.

03

Replit underperforms across most metrics.

Abstract

Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Techniques and Practices · Software Engineering Research · Mobile Crowdsensing and Crowdsourcing