Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset
Hugo Lauren\c{c}on, L\'eo Tronchon, Victor Sanh

TL;DR
This paper introduces WebSight, a large synthetic dataset of 2 million webpage screenshot-HTML pairs, enabling fine-tuning of vision-language models to convert screenshots into HTML code, advancing no-code web development.
Contribution
The paper presents WebSight, the first high-quality, large-scale dataset for training models to convert web screenshots into HTML, and demonstrates fine-tuning a VLM for this task.
Findings
VLMs can be effectively fine-tuned on WebSight to generate HTML from screenshots.
WebSight accelerates research in screenshot-to-HTML conversion.
Open-sourcing WebSight supports further advancements in no-code web development.
Abstract
Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Mobile and Web Applications
