Lexi: Self-Supervised Learning of the UI Language

Pratyay Banerjee; Shweti Mahajan; Kushal Arora; Chitta Baral; Oriana; Riva

arXiv:2301.10165·cs.CL·January 25, 2023

Lexi: Self-Supervised Learning of the UI Language

Pratyay Banerjee, Shweti Mahajan, Kushal Arora, Chitta Baral, Oriana, Riva

PDF

Open Access 1 Repo

TL;DR

Lexi is a self-supervised vision-language model trained on a large UI dataset, enabling better understanding of UI screens for applications like accessibility and automation without relying on UI metadata.

Contribution

We introduce Lexi, a novel pre-trained vision-language model for UI understanding that does not depend on UI metadata, trained on a large curated dataset.

Findings

01

Lexi outperforms existing models on UI action entailment.

02

Lexi achieves high accuracy in instruction-based UI image retrieval.

03

Lexi effectively grounds referring expressions and recognizes UI entities.

Abstract

Humans can learn to operate the user interface (UI) of an application by reading an instruction manual or how-to guide. Along with text, these resources include visual content such as UI screenshots and images of application icons referenced in the text. We explore how to leverage this data to learn generic visio-linguistic representations of UI screens and their components. These representations are useful in many real applications, such as accessibility, voice navigation, and task automation. Prior UI representation models rely on UI metadata (UI trees and accessibility labels), which is often missing, incompletely defined, or not accessible. We avoid such a dependency, and propose Lexi, a pre-trained vision and language model designed to handle the unique features of UI screens, including their text richness and context sensitivity. To train Lexi we curate the UICaption dataset…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/uicaption
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Subtitles and Audiovisual Media · Digital Accessibility for Disabilities