Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
Chenglei Si, Yanzhe Zhang, Ryan Li, Zhengyuan Yang, Ruibo Liu, Diyi, Yang

TL;DR
This paper introduces Design2Code, a benchmark for evaluating multimodal large language models' ability to convert webpage screenshots into accurate code, highlighting current models' limitations in visual element recall and layout accuracy.
Contribution
It presents the first real-world benchmark for multimodal code generation from visual designs, including curated datasets, evaluation metrics, and comprehensive model testing.
Findings
Models struggle with visual element recall.
Models often generate incorrect layouts.
Benchmark reveals significant room for improvement.
Abstract
Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language models (MLLMs) directly convert visual designs into code implementations. In this work, we construct Design2Code - the first real-world benchmark for this task. Specifically, we manually curate 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations to validate the performance ranking. To rigorously benchmark MLLMs, we test various multimodal prompting methods on frontier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsManufacturing Process and Optimization · BIM and Construction Integration
MethodsSparse Evolutionary Training
