WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu; Yunqiao Yang; Houxing Ren; Haotian Hou; Han Xiao; Ke Wang; Weikang Shi; Aojun Zhou; Mingjie Zhan; Hongsheng Li

arXiv:2505.03733·cs.CL·August 12, 2025

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li

PDF

Open Access 1 Repo 7 Models 4 Datasets 1 Video

TL;DR

WebGen-Bench is a comprehensive benchmark for evaluating large language models' ability to generate complex, multi-file websites from scratch, using diverse instructions and rigorous testing procedures.

Contribution

The paper introduces WebGen-Bench, a new benchmark with a large instruction set and testing framework to evaluate LLMs in website generation tasks.

Findings

01

The best model achieves only 27.8% accuracy, indicating the task's difficulty.

02

Training on WebGen-Instruct improves accuracy to 38.2%.

03

The benchmark reveals significant challenges in LLM-based website generation.

Abstract

LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mnluzimu/webgen-bench
pytorchOfficial

Models

Datasets

Videos

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch· slideslive

Taxonomy

TopicsWeb Data Mining and Analysis · Mathematics, Computing, and Information Processing

MethodsSparse Evolutionary Training · ALIGN