WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Zhiyu Lin; Zhengda Zhou; Zhiyuan Zhao; Tianrui Wan; Yilun Ma; Junyu Gao; Xuelong Li

arXiv:2506.07818·cs.CL·June 10, 2025

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

Zhiyu Lin, Zhengda Zhou, Zhiyuan Zhao, Tianrui Wan, Yilun Ma, Junyu Gao, Xuelong Li

PDF

Open Access 1 Repo 1 Video

TL;DR

WebUIBench is a comprehensive benchmark designed to evaluate multimodal large language models across multiple web development sub-tasks, providing detailed insights into their strengths and weaknesses in WebUI-to-Code tasks.

Contribution

This work introduces WebUIBench, a novel multi-view evaluation framework with 21K QA pairs, to systematically assess MLLMs' capabilities in web development tasks.

Findings

01

29 MLLMs evaluated revealing diverse skill profiles.

02

Identified common weaknesses in perception and understanding.

03

Benchmark guides future model improvements.

Abstract

With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires a confluence of multidimensional sub-capabilities to address the challenges of various development phases, constructing a multi-view evaluation framework is crucial for accurately guiding the enhancement of development efficiency. However, existing benchmarks usually fail to provide an assessment of sub-capabilities and focus solely on webpage generation outcomes. In this work, we draw inspiration from the principles of software engineering and further propose WebUIBench, a benchmark systematically designed to evaluate MLLMs in four key areas: WebUI Perception, HTML Programming,WebUI-HTML Understanding, and WebUI-to-Code. WebUIBench comprises 21K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mail-tele-ai/webuibench
noneOfficial

Videos

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code· underline

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsFocus