VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning
Yunpeng Song, Yiheng Bian, Yongtao Tang, Guiyu Ma, Zhongmin Cai

TL;DR
VisionTasker introduces a vision-based UI understanding and LLM-driven step-by-step task planning framework that improves mobile task automation accuracy without relying on traditional view hierarchies, outperforming previous methods.
Contribution
It presents a novel two-stage framework combining vision-based UI interpretation with LLM task planning, eliminating view hierarchy dependence for mobile automation.
Findings
Outperforms previous methods on four datasets
Successfully automates 147 real-world Android tasks
Shows advantages over humans in unfamiliar tasks
Abstract
Mobile task automation is an emerging field that leverages AI to streamline and optimize the execution of routine tasks on mobile devices, thereby enhancing efficiency and productivity. Traditional methods, such as Programming By Demonstration (PBD), are limited due to their dependence on predefined tasks and susceptibility to app updates. Recent advancements have utilized the view hierarchy to collect UI information and employed Large Language Models (LLM) to enhance task automation. However, view hierarchies have accessibility issues and face potential problems like missing object descriptions or misaligned structures. This paper introduces VisionTasker, a two-stage framework combining vision-based UI understanding and LLM task planning, for mobile task automation in a step-by-step manner. VisionTasker firstly converts a UI screenshot into natural language interpretations using a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in Service Interactions · Context-Aware Activity Recognition Systems · Personal Information Management and User Behavior
