Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao,, Zhongyu Wei, Duyu Tang

TL;DR
This paper introduces Chain-of-Action-Thought (CoAT), a novel approach for improving GUI agent action prediction by incorporating semantic information from action descriptions and visual observations, supported by a new dataset.
Contribution
The work presents CoAT, a new method that enhances action prediction in GUI agents using semantic reasoning, and provides the AitZ dataset for future research.
Findings
CoAT significantly outperforms previous context models in zero-shot settings.
Fine-tuning a 1B model on AitZ achieves comparable performance to larger models.
The AitZ dataset contains 18,643 annotated screen-action pairs for training and evaluation.
Abstract
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural language through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more importantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling. To further facilitate the research in this line, we construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSocial Robot Interaction and HRI · Reinforcement Learning in Robotics · AI in Service Interactions
