Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Jiwen Zhang; Jihao Wu; Yihua Teng; Minghui Liao; Nuo Xu; Xiao Xiao,; Zhongyu Wei; Duyu Tang

arXiv:2403.02713·cs.CL·July 16, 2024·1 cites

Android in the Zoo: Chain-of-Action-Thought for GUI Agents

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao,, Zhongyu Wei, Duyu Tang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Chain-of-Action-Thought (CoAT), a novel approach for improving GUI agent action prediction by incorporating semantic information from action descriptions and visual observations, supported by a new dataset.

Contribution

The work presents CoAT, a new method that enhances action prediction in GUI agents using semantic reasoning, and provides the AitZ dataset for future research.

Findings

01

CoAT significantly outperforms previous context models in zero-shot settings.

02

Fine-tuning a 1B model on AitZ achieves comparable performance to larger models.

03

The AitZ dataset contains 18,643 annotated screen-action pairs for training and evaluation.

Abstract

Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

imnearth/coat
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSocial Robot Interaction and HRI · Reinforcement Learning in Robotics · AI in Service Interactions