Lexi: Self-Supervised Learning of the UI Language
Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana, Riva

TL;DR
Lexi is a self-supervised vision-language model trained on a large UI dataset, enabling better understanding of UI screens for applications like accessibility and automation without relying on UI metadata.
Contribution
We introduce Lexi, a novel pre-trained vision-language model for UI understanding that does not depend on UI metadata, trained on a large curated dataset.
Findings
Lexi outperforms existing models on UI action entailment.
Lexi achieves high accuracy in instruction-based UI image retrieval.
Lexi effectively grounds referring expressions and recognizes UI entities.
Abstract
Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations are useful in many real applications, such as accessibility, voice navigation, and task automation. Prior UI representation models rely on UI metadata (UI trees and accessibility labels), which is often missing, incompletely defined, or not accessible. We avoid such a dependency, and propose Lexi, a pre-trained vision and language model designed to handle the unique features of UI screens, including their text richness and context sensitivity. To train Lexi we curate the UICaption dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Digital Accessibility for Disabilities
