Chain-of-Memory: Enhancing GUI Agents for Cross-Application Navigation
Xinzge Gao, Chuanrui Hu, Bin Chen, Teng Li

TL;DR
This paper introduces Chain-of-Memory (CoM), a method for explicitly modeling short-term and long-term memory in GUI agents, improving their understanding and performance in complex cross-application tasks.
Contribution
The paper proposes CoM, a novel explicit memory mechanism for GUI agents, and introduces GUI Odyssey-CoM, a large dataset for training and evaluating memory-enhanced GUI agents.
Findings
CoM significantly improves GUI agents' performance in cross-application tasks.
GUI Odyssey-CoM enables smaller models to match larger models' memory capabilities.
Experimental results show enhanced task understanding with explicit memory modeling.
Abstract
Multimodal large language models (MLLMs) are attracting growing attention in the development of Graphical User Interface (GUI) agents. Existing approaches often rely on historical screenshots or actions to implicitly represent the task state. This reliance poses challenges for GUI agents in accurately understanding task states and underscores the absence of effective mechanisms to store critical information in complex and lengthy cross-app tasks. To address these challenges, we propose Chain-of-Memory (CoM), a novel approach for explicitly modeling short-term and long-term memory in GUI agents. CoM achieves this by capturing action descriptions, integrating task-relevant screen information, and maintaining a dedicated memory module to store and manage this information. By leveraging explicit memory representations, CoM enables GUI agents to better understand task states and retain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
