Learning UI-to-Code Reverse Generator Using Visual Critic Without Rendering
Davit Soselia, Khalid Saifullah, and Tianyi Zhou

TL;DR
This paper introduces ViCT, a vision-code transformer that generates HTML/CSS code from UI screenshots without rendering, using a visual critic for efficient alignment, and demonstrates superior performance on synthetic datasets.
Contribution
The paper presents a novel ViCT model with a visual critic for non-rendering alignment, improving UI-to-code generation accuracy and efficiency over existing methods.
Findings
ViCT achieves higher IoU (0.79) than baseline (0.64).
ViCT reduces MSE from 12.25 to 9.02.
ViCT maintains performance with lower computational cost.
Abstract
Automated reverse engineering of HTML/CSS code from UI screenshots is an important yet challenging problem with broad applications in website development and design. In this paper, we propose a novel vision-code transformer (ViCT) composed of a vision encoder processing the screenshots and a language decoder to generate the code. They are initialized by pre-trained models such as ViT/DiT and GPT-2/LLaMA but aligning the two modalities requires end-to-end finetuning, which aims to minimize the visual discrepancy between the code-rendered webpage and the original screenshot. However, the rendering is non-differentiable and causes costly overhead. We address this problem by actor-critic fine-tuning where a visual critic without rendering (ViCR) is developed to predict visual discrepancy given the original and generated code. To train and evaluate our models, we created two synthetic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Online Learning and Analytics
MethodsMulti-Head Attention · Absolute Position Encodings · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Attention Is All You Need · Linear Layer · Label Smoothing · Adam
