WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Xinping Lei; Xinyu Che; Junqi Xiong; Chenchen Zhang; Yukai Huang; Chenyu Zhou; Haoyang Huang; Minghao Liu; Letian Zhu; Hongyi Ye; Jinhua Hao; Ken Deng; Zizheng Zhan; Han Li; Dailin Li; Yifan Yao; Ming Sun; Zhaoxiang Zhang; Jiaheng Liu

arXiv:2604.18224·cs.SE·April 21, 2026

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Xinping Lei, Xinyu Che, Junqi Xiong, Chenchen Zhang, Yukai Huang, Chenyu Zhou, Haoyang Huang, Minghao Liu, Letian Zhu, Hongyi Ye, Jinhua Hao, Ken Deng, Zizheng Zhan, Han Li, Dailin Li, Yifan Yao, Ming Sun, Zhaoxiang Zhang, Jiaheng Liu

PDF

TL;DR

WebCompass is a comprehensive multimodal benchmark for evaluating web coding capabilities of language models across generation, editing, and repair tasks, reflecting real-world workflows.

Contribution

It introduces a multi-modal, multi-task benchmark with human-in-the-loop evaluation protocols, covering diverse web engineering scenarios and providing insights into model strengths and weaknesses.

Findings

01

Closed-source models outperform open-source models in web coding tasks.

02

Repair tasks better preserve interactivity but are more challenging to execute.

03

Aesthetics remains a persistent bottleneck, especially for open-source models.

Abstract

Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.