ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

TL;DR
ScreenCoder introduces a modular multi-agent system for transforming UI images into front-end code, significantly improving accuracy and robustness over monolithic models through specialized stages and scalable data generation.
Contribution
It proposes a novel multi-agent framework that decomposes UI-to-code tasks into interpretable stages, enhancing robustness and enabling scalable data creation for model fine-tuning.
Findings
Achieves state-of-the-art layout accuracy and code correctness.
Significantly improves robustness over end-to-end models.
Enables scalable high-quality data generation for training.
Abstract
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
+ Clear problem framing and modularization. The paper gives a concrete analysis of two failure modes of current MLLMs on UI-to-code (perception vs. planning) and maps them 1:1 to the three agents, which makes the overall story quite interpretable and also explains why “one big model” often fails in practice. The method section is readable and the pipeline could plausibly be reimplemented. + Stronger evaluation setting. Introducing ScreenBench (1k, more contemporary, more structurally complex) is
- Limited novelty relative to existing agentic image-to-code pipelines. The main idea—splitting the visual-to-code process into grounding, layout planning, and code generation, each handled by an MLLM—is conceptually similar to several recent agentic or divide-and-conquer image-to-code systems [1, 2]. Those works have already argued that monolithic MLLMs conflate perception and layout reasoning and proposed staged workflows for webpage reconstruction. The contribution currently appears to rest o
- New dataset and benchmark: The paper presents Screen-10K (curated from 50k webpages into 10k clean pairs to stabilize training) and ScreenBench (1,000 contemporary websites emphasizing complex nested layouts). - The paper is generally well written.
- There are already some datasets on UI code generation, which the paper did not discuss/compare with. For example: Gui et al., VISION2UI: A Real-World Dataset with Layout for Code Generation from UI Designs, https://arxiv.org/abs/2404.06369v1, April 2024. Hugo Laurenccon, L’eo Tronchon, and Victor Sanh. Unlocking the conversion of web screenshots into html code with the websight dataset. 2024. URL https://api.semanticscholar.org/CorpusID:268385510. Especially, the VISION2UI dataset is also e
1. The works seems to be well motivated and worked upon with basics which is good to see 2.The results look promising both qualitatively and quantitatively 3. The paper is well written and easy to follow
1. I do not see a discussion on the competitive works relating to screenbench. 2. Some previous instances and citations in section 3 would be useful to support the claims 3. Some analysis on why certain metric is good or bad can be useful 4. What about some failure case analysis
* Breaking the problem into perceptual (vision) and logical (planning, coding) stages is a sensible and interpretable approach. * The method achieves high performance across multiple metrics and show clear gains over both open baselines and earlier systems. * By generating image-code pairs and a curated test benchmark , the authors contribute valuable datasets.
* From a research perspective, ScreenCoder is a blend of existing techniques rather than a fundamentally novel invention. DCGen, LayoutCoder and UICopilot have all used hierarchical generation heuristics, the main difference is leveraging a pretrained VLM for component detection and adding RL fine-tuning, While these yield better results, the conceptual novelty is limited * The Planning Agent relies on fixed rules and “front-end engineering priors” (hardcoded grid templates, etc.). These heurist
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
