Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Kai Xu; YiWei Mao; XinYi Guan; ZiLong Feng

arXiv:2505.07473·cs.AI·May 13, 2025

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Kai Xu, YiWei Mao, XinYi Guan, ZiLong Feng

PDF

Open Access 1 Repo 1 Datasets

TL;DR

Web-Bench is a new, comprehensive benchmark for evaluating large language models in web development tasks, emphasizing real-world workflows and foundational web standards and frameworks.

Contribution

The paper introduces Web-Bench, a challenging, multi-project benchmark based on real-world web development tasks, addressing saturation issues in existing LLM code benchmarks.

Findings

01

SOTA LLM achieves only 25.1% Pass@1 on Web-Bench.

02

Web-Bench projects are designed by experienced engineers, increasing difficulty.

03

Benchmark highlights the need for LLM optimization for web standards and frameworks.

Abstract

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks primarily focused on code generation accuracy, but these benchmarks have gradually become saturated. Benchmark saturation weakens their guiding role for LLMs. For example, HumanEval Pass@1 has reached 99.4% and MBPP 94.2%. Among various attempts to address benchmark saturation, approaches based on software engineering have stood out, but the saturation of existing software engineering benchmarks is rapidly increasing. To address this, we propose a new benchmark, Web-Bench, which contains 50 projects, each consisting of 20 tasks with sequential dependencies. The tasks implement project features in sequence, simulating real-world human development…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bytedance/Web-Bench
noneOfficial

Datasets

bytedance-research/Web-Bench
dataset· 374 dl
374 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Topic Modeling · Model-Driven Software Engineering Techniques