SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena; Nilesh Trivedi; Vinayaka Jyothi

arXiv:2605.04637·cs.MA·May 7, 2026

SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies

Siddhant Saxena, Nilesh Trivedi, Vinayaka Jyothi

PDF

1 Repo

TL;DR

This paper introduces SWE-WebDev Bench, a comprehensive evaluation framework for AI-powered web development platforms, revealing common shortcomings and providing a community resource for benchmarking and improvement.

Contribution

It presents a novel 68-metric evaluation framework for assessing AI coding agents as virtual software agencies, along with empirical findings on current platform limitations.

Findings

01

Platforms often oversimplify business requirements.

02

Widespread frontend-backend decoupling issues.

03

No platform exceeds 60% in engineering quality.

Abstract

The emergence of "vibe coding" platforms, where users describe applications in natural language and AI agents autonomously generate full-stack software, has created a need for rigorous evaluation beyond code-level benchmarks. In order to assess them as virtual software development agencies on understanding business requirements, making architectural decisions, writing production code, handling iterative modifications, and maintaining business readiness, we introduce SWE-WebDev Bench, a 68-metric evaluation framework spanning 25 primary and 43 diagnostic metrics across seven groups, organized along three dimensions: Interaction Mode (App Creation Request (ACR) vs. App Modification Request (AMR)), Agency Angle (Product Manager (PM), Engineering, Ops), and Complexity Tier (T4 multi-role SaaS, T5 AI-native). Our evaluation (six platforms, three domains, 18 evaluation cells) reveals four…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snowmountainAi/webdevbench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.