OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou, Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, Yu Qiao

TL;DR
OS-ATLAS introduces a new open-source GUI action model that significantly improves GUI grounding and out-of-distribution generalization across multiple platforms, supported by a large synthesized dataset and extensive benchmarking.
Contribution
The paper presents OS-ATLAS, a foundational GUI action model with innovative data synthesis and training methods, enabling better GUI understanding and generalization in open-source VLMs.
Findings
OS-ATLAS outperforms previous models on six benchmarks.
Developed the largest open-source cross-platform GUI grounding dataset.
Demonstrated improved GUI understanding and OOD generalization.
Abstract
Existing efforts in building GUI agents heavily rely on the availability of robust commercial Vision-Language Models (VLMs) such as GPT-4o and GeminiProVision. Practitioners are often reluctant to use open-source VLMs due to their significant performance lag compared to their closed-source counterparts, particularly in GUI grounding and Out-Of-Distribution (OOD) scenarios. To facilitate future research in this area, we developed OS-Atlas - a foundational GUI action model that excels at GUI grounding and OOD agentic tasks through innovations in both data and modeling. We have invested significant engineering effort in developing an open-source toolkit for synthesizing GUI grounding data across multiple platforms, including Windows, Linux, MacOS, Android, and the web. Leveraging this toolkit, we are releasing the largest open-source cross-platform GUI grounding corpus to date, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ByteDance-Seed/UI-TARS-1.5-7Bmodel· 142k dl· ♡ 533142k dl♡ 533
- 🤗OS-Copilot/OS-Atlas-Base-7Bmodel· 952 dl· ♡ 42952 dl♡ 42
- 🤗OS-Copilot/OS-Atlas-Base-4Bmodel· 314 dl· ♡ 10314 dl♡ 10
- 🤗OS-Copilot/OS-Atlas-Pro-7Bmodel· 43 dl· ♡ 2843 dl♡ 28
- 🤗OS-Copilot/OS-Atlas-Pro-4Bmodel· 20 dl· ♡ 320 dl♡ 3
- 🤗Mungert/UI-TARS-1.5-7B-GGUFmodel· 1.7k dl· ♡ 131.7k dl♡ 13
- 🤗what2up/UI-TARS-1.5-7Bmodel· 17 dl· ♡ 117 dl♡ 1
- 🤗flin775/UI-TARS-1.5-7B-AWQmodel· 755 dl755 dl
- 🤗vocaela/Vocaela-500Mmodel· 40 dl· ♡ 340 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multi-Agent Systems and Negotiation · Social Robot Interaction and HRI
