Learning UI Navigation through Demonstrations composed of Macro Actions
Wei Li

TL;DR
This paper presents a framework for training UI navigation agents using demonstrations of macro actions, leveraging simplified state representations and demo augmentation to achieve high success rates across diverse applications.
Contribution
The authors introduce a novel UI navigation framework with a customizable action space and demo augmentation, enabling efficient training with minimal demonstrations and coverage of rare cases.
Findings
Achieved 98.7% success rate on complex UI navigation tasks.
Reduced demonstration requirements through demo augmentation.
Enabled training on diverse apps with limited human input.
Abstract
We have developed a framework to reliably build agents capable of UI navigation. The state space is simplified from raw-pixels to a set of UI elements extracted from screen understanding, such as OCR and icon detection. The action space is restricted to the UI elements plus a few global actions. Actions can be customized for tasks and each action is a sequence of basic operations conditioned on status checks. With such a design, we are able to train DQfD and BC agents with a small number of demonstration episodes. We propose demo augmentation that significantly reduces the required number of human demonstrations. We made a customization of DQfD to allow demos collected on screenshots to facilitate the demo coverage of rare cases. Demos are only collected for the failed cases during the evaluation of the previous version of the agent. With 10s of iterations looping over evaluation, demo…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Topic Modeling · Multimodal Machine Learning Applications
