ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan, Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu, Sharma

TL;DR
ScreenAI is a vision-language model designed for understanding user interfaces and infographics, leveraging novel training strategies and datasets to achieve state-of-the-art performance on multiple UI and infographic tasks.
Contribution
The paper introduces ScreenAI, a specialized vision-language model with a novel screen annotation task and scalable dataset generation, improving UI and infographic understanding.
Findings
Achieves state-of-the-art results on multiple UI and infographic tasks
Uses a novel screen annotation task for training
Releases three new datasets for UI understanding
Abstract
Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI-…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Human-Technology Interaction · Persona Design and Applications
MethodsActivation Patching
