From Prompt to Product: A Human-Centered Benchmark of Agentic App Generation Systems
Marcos Ortiz, Justin Hill, Collin Overbay, Ingrida Semenec, Frederic Sauve-Hoover, Jim Schwoebel, Joel Shor

TL;DR
This paper introduces a human-centered benchmark for evaluating prompt-to-app AI systems, comparing Replit, Bolt, and Firebase Studio through large-scale human studies to assess usability, visual appeal, trust, and completeness.
Contribution
It presents a new human-centered benchmark framework and conducts a comprehensive comparative study of three prominent prompt-to-app platforms using diverse prompts and human evaluations.
Findings
Firebase Studio outperforms others in all evaluated dimensions.
Bolt is competitive in visual appeal but lags in usability and trust.
Replit underperforms across most metrics.
Abstract
Agentic AI systems capable of generating full-stack web applications from natural language prompts ("prompt- to-app") represent a significant shift in software development. However, evaluating these systems remains challenging, as visual polish, functional correctness, and user trust are often misaligned. As a result, it is unclear how existing prompt-to-app tools compare under realistic, human-centered evaluation criteria. In this paper, we introduce a human-centered benchmark for evaluating prompt-to-app systems and conduct a large-scale comparative study of three widely used platforms: Replit, Bolt, and Firebase Studio. Using a diverse set of 96 prompts spanning common web application tasks, we generate 288 unique application artifacts. We evaluate these systems through a large-scale human-rater study involving 205 participants and 1,071 quality-filtered pairwise comparisons,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Techniques and Practices · Software Engineering Research · Mobile Crowdsensing and Crowdsourcing
