Reinforced UI Instruction Grounding: Towards a Generic UI Task Automation API
Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, Yan Lu

TL;DR
This paper introduces a multimodal, metadata-free model that grounds natural language instructions in UI screenshots for automation, leveraging pretrained document understanding and reinforcement learning to enhance spatial decoding accuracy.
Contribution
The work presents a novel reinforced pixel-to-sequence model for UI instruction grounding, improving spatial decoding and demonstrating potential as a universal UI automation API.
Findings
Outperforms state-of-the-art methods in UI instruction grounding
Effective reinforcement learning enhances spatial decoding accuracy
Shows promise as a generic API for UI task automation
Abstract
Recent popularity of Large Language Models (LLMs) has opened countless possibilities in automating numerous AI tasks by connecting LLMs to various domain-specific models or APIs, where LLMs serve as dispatchers while domain-specific models or APIs are action executors. Despite the vast numbers of domain-specific models/APIs, they still struggle to comprehensively cover super diverse automation demands in the interaction between human and User Interfaces (UIs). In this work, we build a multimodal model to ground natural language instructions in given UI screenshots as a generic UI task automation executor. This metadata-free grounding model, consisting of a visual encoder and a language decoder, is first pretrained on well studied document understanding tasks and then learns to decode spatial information from UI screenshots in a promptable way. To facilitate the exploitation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
