WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
Chenxu Liu, Yingjie Fu, Wei Yang, Ying Zhang, Tao Xie

TL;DR
WebCoderBench is a comprehensive and interpretable benchmark for evaluating web application generation by LLMs, using real user requirements and diverse metrics to guide model improvement.
Contribution
It introduces the first real-world, generalizable, and interpretable benchmark with 1,572 user requirements and 24 evaluation metrics for web app generation.
Findings
No single model dominates across all metrics
WebCoderBench enables targeted model optimization
Experiments reveal diverse strengths and weaknesses of LLMs
Abstract
Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we introduce WebCoderBench, the first real-world-collected, generalizable, and interpretable benchmark for web app generation. WebCoderBench comprises 1,572 real user requirements, covering diverse modalities and expression styles that reflect realistic user intentions. WebCoderBench provides 24 fine-grained evaluation metrics across 9 perspectives, combining rule-based and LLM-as-a-judge paradigm for fully automated, objective, and general…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Software Engineering Research · Software Engineering Techniques and Practices
