Reinforced UI Instruction Grounding: Towards a Generic UI Task   Automation API

Zhizheng Zhang; Wenxuan Xie; Xiaoyi Zhang; Yan Lu

arXiv:2310.04716·cs.CV·October 10, 2023·2 cites

Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API

Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu

PDF

Open Access

TL;DR

This paper introduces a multimodal, metadata-free model that grounds natural language instructions in UI screenshots for automation, leveraging pretrained document understanding and reinforcement learning to enhance spatial decoding accuracy.

Contribution

The work presents a novel reinforced pixel-to-sequence model for UI instruction grounding, improving spatial decoding and demonstrating potential as a universal UI automation API.

Findings

01

Outperforms state-of-the-art methods in UI instruction grounding

02

Effective reinforcement learning enhances spatial decoding accuracy

03

Shows promise as a generic API for UI task automation

Abstract

Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling