UI-JEPA: Towards Active Perception of User Intent through Onscreen User   Activity

Yicheng Fu; Raviteja Anantha; Prabal Vashisht; Jianpeng Cheng; Etai; Littwin

arXiv:2409.04081·cs.CL·October 3, 2024

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai, Littwin

PDF

Open Access

TL;DR

UI-JEPA introduces a self-supervised framework for understanding user intent from UI actions, achieving high accuracy with significantly lower computational resources compared to large multimodal models.

Contribution

The paper presents UI-JEPA, a lightweight, self-supervised learning approach with new datasets, enabling efficient user intent prediction comparable to large models but with fewer resources.

Findings

01

UI-JEPA matches state-of-the-art large MLLMs in intent prediction.

02

It reduces computational cost by over 50 times.

03

It outperforms GPT-4 Turbo and Claude 3.5 Sonnet in intent similarity scores.

Abstract

Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, "Intent in the Wild" (IIW) and "Intent in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInnovative Human-Technology Interaction · Data Visualization and Analytics · Virtual Reality Applications and Impacts

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer