WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality
Chunyang Li, Yilun Zheng, Xinting Huang, Tianqing Fang, Jiahao Xu, Lihui Chen, Yangqiu Song, Han Hu

TL;DR
WebDevJudge introduces a benchmark for evaluating the reliability of LLMs as judges in web development, revealing significant gaps compared to human experts and highlighting key limitations of current models.
Contribution
This work provides the first systematic benchmark for assessing LLM-based evaluation in web development, including both static and interactive scenarios.
Findings
LLMs significantly lag behind human experts in web development evaluation.
Fundamental model limitations include failure to recognize functional equivalence and verify task feasibility.
The benchmark reveals critical areas for improving LLM evaluation capabilities.
Abstract
The paradigm of LLM-as-a-judge is emerging as a scalable and efficient alternative to human evaluation, demonstrating strong performance on well-defined tasks. However, its reliability in open-ended tasks with dynamic environments and complex interactions remains unexplored. To bridge the gap, we introduce WebDevJudge, a systematic benchmark for assessing LLM-as-a-judge performance in web development, with support for both non-interactive evaluation based on static observations and continuous interactive evaluation with a dynamic web environment. WebDevJudge comprises human preference labels over paired web implementations, annotated with structured and query-grounded rubrics to ensure high-quality ground truth. Using this benchmark, we comprehensively evaluate various evaluators, including LLMs, MLLMs, and agentic workflows. We systematically investigate the impact of different…
Peer Reviews
Decision·ICLR 2026 Oral
Given the widespread adoption of LLM as a judge, the need for better "judgements" of llm-as-a-judge has grown as well. This paper adeptly addresses that. The paper text is for the most part clear and easy to follow. The benchmark is also original as to my knowledge there is not a specific benchmark for web judgements.
- The main issue I see is that the results for all models seem relatively similar (~50s to 60s). There's not a lot of variation in terms of performance and there's no statistical tests to indicate that these values are actually meaningful. I have a strong suspicion that the reason these numbers are so similar is in fact because of the rubric. As pointed out in the paper, it increases inter-annotator agreement, but my guess is that it likely increases agreement *overall* as well and doesn't accur
- **A challenging benchmark that evaluators can be tested on**, the highest performing evaluator achieves only 66% agreement with humans, indicating a large gap for improvement. - **Clear insight into challenges in current evaluators**, the analysis that follows the main results from the benchmark highlights several key factors causing models to fail to be more effective evaluators. Other researchers can easily identify and work on improving these common failure modes. - **Extremely well written
- **Low number of human annotators**: Only two annotators were used; their agreement is high, but I do wonder about some of the examples where the models are failing to agree with humans. If you gave those examples to more annotators, maybe we would find that they are actually somewhat ambigious. - **Lack of concrete examples**: Some of these failure modes are quite high-level, like "operational reliability." I didn't see any example outputs from the models, but placing a few in the appendix may
1. A good analysis on the new benchmark that is introduced, in a space where benchmarks (and especially meta benchmarks) are much needed. 2. Draws conclusions that seem sound, and are useful in both designing solutions as designing other benchmarks.
1. Details about the agentic setup are lacking. They have some details in the appendix, but no analysis of where the agentic setup fails. We know agentic setups with interaction with GUIs are still not great (and we know that they often are better with e.g. selecting using elements vs coordinates), but more details on how useful the agentic side is for the final conclusion would be good for the paper. 2. In the appendix I see examples, but none have images. If this is a known limitation, can you
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Web Data Mining and Analysis · Software Testing and Debugging Techniques
