TL;DR
Vision2Web is a comprehensive benchmark for evaluating visual website development capabilities of AI models, covering static, interactive, and full-stack tasks, with a new verification paradigm.
Contribution
It introduces a hierarchical benchmark with real-world tasks and a novel agent verification method for systematic evaluation of website development models.
Findings
State-of-the-art models show significant performance gaps.
Models struggle with full-stack website development.
Benchmark covers 193 tasks across 16 categories.
Abstract
Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
