DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation
Xinhao Huang, Jinke Yu, Wenhao Xu, Zeyi Wen, Ying Zhou, Junzhuo Liu, Junhao Ji, Zulong Chen

TL;DR
DOne is a novel framework that separates structure understanding from element rendering to improve high-fidelity design-to-code generation, addressing limitations of existing Vision Language Models.
Contribution
It introduces a decoupled approach with a layout segmentation module, a hybrid element retriever, and a schema-guided generation paradigm, along with a new benchmark HiFi2Code.
Findings
DOne outperforms existing methods in visual similarity and element alignment.
Achieves over 10% improvement in GPT Score.
Human evaluations show a 3x productivity gain.
Abstract
While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
