ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang; Yuhang Li; Can Xu; Jiaheng Liu; Ao Liu; Changzhi Zhou; Ken Deng; Dengpeng Wu; Guanhua Huang; Kejiao Li; Qi Yi; Ruibin Xiong; Shihui Hu; Yue Zhang; Yuhao Jiang; Zenan Xu; Yuanxing Zhang; Wiggin Zhou; Chayse Zhou; Fengzong Lian

arXiv:2507.04952·cs.CL·September 30, 2025

ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation

Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Changzhi Zhou, Ken Deng, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Shihui Hu, Yue Zhang, Yuhao Jiang, Zenan Xu, Yuanxing Zhang, Wiggin Zhou, Chayse Zhou, Fengzong Lian

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

ArtifactsBench introduces a comprehensive, automated multimodal evaluation framework for visual code generation by LLMs, bridging the gap between code correctness and visual-interactive quality assessment.

Contribution

We propose ArtifactsBench, a novel benchmark and evaluation paradigm that assesses visual code artifacts using multimodal analysis and a fine-grained checklist, enabling scalable, human-aligned quality measurement.

Findings

01

Achieves 94.4% ranking consistency with WebDev Arena.

02

Over 90% pairwise agreement with human experts.

03

Generalist models often outperform domain-specific models.

Abstract

The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

- Having three temporal screenshots is an addition to existing evals that only look at one static screenshot. - It's great to get an extra benchmark for visual artifact generation.

Weaknesses

- Why do you not have any examples of the actual benchmark examples? Not even in the Appendix? It makes it much harder to judge the actual quality of the benchmark. - I'm not quite sure if shoving three screenshots of the interactions to the LLM judge is the best way to evaluate the functional correctness of the dynamic interaction? Do you have any sort of human evaluation that lets users try out the generated websites to perform some specified, realistic tasks? Would that correlate with your

Reviewer 02Rating 4Confidence 4

Strengths

1. It is a comprehensive benchmark with 1800+ tasks. 2. More than 30 models are benchmarked.

Weaknesses

1. Benchmarking visual code generation is not a novel problem; there are many works in this direction. We have similar benchmarks for the website and SVG before, while this benchmark claims to extend the scope to Game, Simulation, Data Science, etc, the evaluation idea is largely similar: show screenshots to MLLM and ask for judgment. I don't think you can judge the quality of a game by screenshots with limited interaction. In general, I don't see many useful insights from this very broad benchm

Reviewer 03Rating 8Confidence 3

Strengths

- The paper introduces a valuable resource with 1,825 executable tasks spanning 9 domains with difficulty tiers; supports fine-grained analysis beyond single static correctness. - Proposes an interactive evaluation design where three-step screenshots and sandboxed execution capture dynamics while keeping runs reproducible. - Evaluates an extensive suite of 30+ LLMs, spanning both open-source and proprietary models; evaluation results show high pairwise agreement (up to 90.95%) and 94.4% Footrule

Weaknesses

- Three screenshots may miss long-horizon workflows and nuanced physics/UX timing; authors acknowledge this. including richer scripted interactions or short videos may strengthen the evaluation. - Fixed 1024×768 and single-browser setting may underrepresent responsive/adaptive designs; consider multi-viewport evaluation. - Checklists are LLM-drafted then human-refined; potential leakage of judge priors and over-optimization to rubric specifics—worth stress-tests with diverse/adversarial prompt s

Code & Models

Datasets

tencent/ArtifactsBenchmark
dataset· 154 dl
154 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsFocus