Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
Kai Xu, YiWei Mao, XinYi Guan, ZiLong Feng

TL;DR
Web-Bench is a new, comprehensive benchmark for evaluating large language models in web development tasks, emphasizing real-world workflows and foundational web standards and frameworks.
Contribution
The paper introduces Web-Bench, a challenging, multi-project benchmark based on real-world web development tasks, addressing saturation issues in existing LLM code benchmarks.
Findings
SOTA LLM achieves only 25.1% Pass@1 on Web-Bench.
Web-Bench projects are designed by experienced engineers, increasing difficulty.
Benchmark highlights the need for LLM optimization for web standards and frameworks.
Abstract
The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Model-Driven Software Engineering Techniques
