TL;DR
This paper introduces SWE-WebDev Bench, a comprehensive evaluation framework for AI-powered web development platforms, revealing common shortcomings and providing a community resource for benchmarking and improvement.
Contribution
It presents a novel 68-metric evaluation framework for assessing AI coding agents as virtual software agencies, along with empirical findings on current platform limitations.
Findings
Platforms often oversimplify business requirements.
Widespread frontend-backend decoupling issues.
No platform exceeds 60% in engineering quality.
Abstract
The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native). Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
