WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li

TL;DR
WebGen-Bench is a comprehensive benchmark for evaluating large language models' ability to generate complex, multi-file websites from scratch, using diverse instructions and rigorous testing procedures.
Contribution
The paper introduces WebGen-Bench, a new benchmark with a large instruction set and testing framework to evaluate LLMs in website generation tasks.
Findings
The best model achieves only 27.8% accuracy, indicating the task's difficulty.
Training on WebGen-Instruct improves accuracy to 38.2%.
The benchmark reveals significant challenges in LLM-based website generation.
Abstract
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent's ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗luzimu/WebGen-LM-32Bmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗luzimu/WebGen-LM-14Bmodel· 2 dl· ♡ 22 dl♡ 2
- 🤗luzimu/WebGen-LM-7Bmodel· 3 dl· ♡ 33 dl♡ 3
- 🤗luzimu/WebGenAgent-LM-8B-SFTmodel· 6 dl6 dl
- 🤗luzimu/WebGenAgent-LM-8B-Step-GRPOmodel· 3 dl3 dl
- 🤗luzimu/WebGenAgent-LM-7B-SFTmodel· 1 dl1 dl
- 🤗luzimu/WebGenAgent-LM-7B-Step-GRPOmodel· 1 dl1 dl
Videos
Taxonomy
TopicsWeb Data Mining and Analysis · Mathematics, Computing, and Information Processing
MethodsSparse Evolutionary Training · ALIGN
