ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Jakub Hoscilowicz; Bartosz Maj; Bartosz Kozakiewicz; Oleksii; Tymoshchuk; Artur Janicki

arXiv:2410.11872·cs.HC·October 18, 2024

ClickAgent: Enhancing UI Location Capabilities of Autonomous Agents

Jakub Hoscilowicz, Bartosz Maj, Bartosz Kozakiewicz, Oleksii, Tymoshchuk, Artur Janicki

PDF

Open Access 1 Repo

TL;DR

ClickAgent is a new framework that combines large language models with a UI location model to improve automation of GUI tasks on mobile devices, addressing a key limitation of current MLLMs.

Contribution

It introduces a hybrid approach integrating reasoning and UI element localization, enhancing autonomous agent performance in GUI interactions.

Findings

01

Outperforms CogAgent and AppAgent on AITW benchmark

02

Achieves higher task success rates on Android devices

03

Effectively combines LLM reasoning with UI element detection

Abstract

With the growing reliance on digital devices equipped with graphical user interfaces (GUIs), such as computers and smartphones, the need for effective automation tools has become increasingly important. While multimodal large language models (MLLMs) like GPT-4V excel in many areas, they struggle with GUI interactions, limiting their effectiveness in automating everyday tasks. In this paper, we introduce ClickAgent, a novel framework for building autonomous agents. In ClickAgent, the MLLM handles reasoning and action planning, while a separate UI location model (e.g., SeeClick) identifies the relevant UI elements on the screen. This approach addresses a key limitation of current-generation MLLMs: their difficulty in accurately locating UI elements. ClickAgent outperforms other prompt-based autonomous agents (CogAgent, AppAgent) on the AITW benchmark. Our evaluation was conducted on both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Samsung/ClickAgent
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Mobile Agent-Based Network Management · Web Data Mining and Analysis