Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran; Langston Nashold; Rayan Krishnan; Antoine Bigeard; Alex Gu

arXiv:2603.04601·cs.SE·May 15, 2026

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Hung Tran, Langston Nashold, Rayan Krishnan, Antoine Bigeard, Alex Gu

PDF

TL;DR

Vibe Code Bench introduces a comprehensive benchmark for evaluating AI models on end-to-end web application development, highlighting current challenges and factors influencing performance.

Contribution

It provides a novel dataset, evaluation pipeline, and analysis of 16 models for end-to-end web app creation, emphasizing the importance of self-testing and evaluator alignment.

Findings

01

Best model achieves 61.8% accuracy on test split.

02

Self-testing during generation strongly predicts performance.

03

Evaluator selection significantly impacts outcome variability.

Abstract

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench, a benchmark of 100 web application specifications (50 private validation, 50 held-out test) with 964 browser-based workflows comprising 10,131 substeps, evaluated against deployed applications by an autonomous browser agent. Across 16 frontier models, the best achieves 61.8% accuracy on the test split, revealing that reliable end-to-end application development remains a frontier challenge. We identify self-testing during generation as a strong performance predictor (Pearson r=0.72), and show through a completed human alignment study that evaluator selection materially affects outcomes (31.8-93.6% pairwise step-level agreement). Our contributions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.