UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity
Yicheng Fu, Raviteja Anantha, Prabal Vashisht, Jianpeng Cheng, Etai, Littwin

TL;DR
UI-JEPA introduces a self-supervised framework for understanding user intent from UI actions, achieving high accuracy with significantly lower computational resources compared to large multimodal models.
Contribution
The paper presents UI-JEPA, a lightweight, self-supervised learning approach with new datasets, enabling efficient user intent prediction comparable to large models but with fewer resources.
Findings
UI-JEPA matches state-of-the-art large MLLMs in intent prediction.
It reduces computational cost by over 50 times.
It outperforms GPT-4 Turbo and Claude 3.5 Sonnet in intent similarity scores.
Abstract
Generating user intent from a sequence of user interface (UI) actions is a core challenge in comprehensive UI understanding. Recent advancements in multimodal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent prediction. We also introduce two new UI-grounded multimodal datasets, "Intent in the Wild" (IIW) and "Intent in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInnovative Human-Technology Interaction · Data Visualization and Analytics · Virtual Reality Applications and Impacts
MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Layer Normalization · Dropout · Position-Wise Feed-Forward Layer · Residual Connection · Linear Layer
