OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu; Zhenyu Wu; Fangzhi Xu; Yian Wang; Qiushi Sun; Chengyou; Jia; Kanzhi Cheng; Zichen Ding; Liheng Chen; Paul Pu Liang; Yu Qiao

arXiv:2410.23218·cs.CL·October 31, 2024

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou, Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

PDF

Open Access 2 Repos 9 Models 5 Datasets

TL;DR

OS-ATLAS introduces a new open-source GUI action model that significantly improves GUI grounding and out-of-distribution generalization across multiple platforms, supported by a large synthesized dataset and extensive benchmarking.

Contribution

The paper presents OS-ATLAS, a foundational GUI action model with innovative data synthesis and training methods, enabling better GUI understanding and generalization in open-source VLMs.

Findings

01

OS-ATLAS outperforms previous models on six benchmarks.

02

Developed the largest open-source cross-platform GUI grounding dataset.

03

Demonstrated improved GUI understanding and OOD generalization.

Abstract

Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multi-Agent Systems and Negotiation · Social Robot Interaction and HRI