Harnessing Webpage UIs for Text-Rich Visual Understanding
Junpeng Liu, Tianyue Ou, Yifan Song, Yuxiao Qu, Wai Lam, Chenyan, Xiong, Wenhu Chen, Graham Neubig, Xiang Yue

TL;DR
This paper introduces MultiUI, a large dataset of webpage UI data synthesized from web pages, enabling multimodal models to improve text-rich visual understanding and generalize across web and non-web tasks.
Contribution
The paper presents MultiUI, a new large-scale dataset derived from webpage UIs, and demonstrates its effectiveness in training models for diverse multimodal tasks and domains.
Findings
Models trained on MultiUI achieve up to 48% improvement on VisualWebBench.
A 19.1% increase in element accuracy on the Mind2Web dataset.
Models generalize well to non-web UI and non-UI tasks like document understanding and OCR.
Abstract
Text-rich visual understanding-the ability to process environments where dense textual content is integrated with visuals-is crucial for multimodal large language models (MLLMs) to interact effectively with structured environments. To enhance this capability, we propose synthesizing general multimodal instructions from webpage UIs using text-based large language models (LLMs). Despite lacking direct visual input, text-based LLMs are able to process structured text representations from webpage accessibility trees. These instructions are then paired with UI screenshots to train multimodal models. We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts. Models trained on MultiUI not only excel in web UI tasks-achieving up to a 48% improvement on VisualWebBench and a 19.1% boost in element accuracy on a web…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper is easy to follow and provides substantial figures for improving its understanding. - The methodology for defining the data collection pipeline seems valid and the qualitative examples show that the samples are relevant. - The proposed dataset provides substantial annotations relevant for the UI understanding field. It is a large-scale dataset, and can be relevant for the VLM community. - Experimental results of training Llava-based VLMs on the proposed data shows that models perform
- The paper lacks a comprehensive comparison with other datasets in the field, in terms of number of samples, and types of annotations and tasks they can perform. Some examples are SeeClick, Ferret-UI, Mind2Web, or WebArena, among others. - The selection of baselines seems limited, mostly relying on Llava. Most of the results on other baselines and architecture appear empty (referring to tables 2 and 3). Authors need to thoroughly evaluate closed models like GPT4o or Claude 3.5 in the proposed
1. The introduction of MultiUI as a large, diverse dataset specifically designed to enhance multimodal understanding using structured web UI data. 2. The paper demonstrates the dataset's impact, showing substantial performance gains over existing baselines across various multimodal tasks, emphasizing its importance for text-rich visual understanding.
1. There's a lack of documentation, making the dataset hard to navigate and providing no usage guide for the code. The authors should improve accessibility and include a clear code guide. 2. In the construction pipeline of MultiUI, while the use of Llama and LLaVA may be reasonable for certain tasks, relying on them for higher-level tasks like question answering raises concerns about reliability. Using these models for complex tasks in the dataset risks introducing biases and undermining the ro
* The collected dataset appears relevant to the burgeoning field of LLM based UI understanding and control. * The method introduced to collect the dataset is described clearly and the identified sub-tasks are relevant and well motivated. * The models fine tuned on the introduced data achieve strong results across benchmarks, which supports the utility of the data collected.
* The description of the grounding task data generation (2.3.3) could be more specific. The description of how Llama 3 is used is not very clear to me ("[the model] is not only prompted to generate multiple grounding instructios to predict the bounding box of a given element but also provides the corresponding ground-truth bounding boxes"). For element grounding the extraction from the DOM tree is not described. Is the "element description" simply the text corresponding to an element (as per fig
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
