macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang; Hai Ci; and Mike Zheng Shou

arXiv:2506.04135·cs.AI·October 21, 2025

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang, Hai Ci, and Mike Zheng Shou

PDF

Open Access 1 Repo 1 Video

TL;DR

macOSWorld is the first comprehensive multilingual benchmark for GUI agents on macOS, evaluating their performance across diverse tasks, languages, and safety vulnerabilities, revealing significant gaps in current models.

Contribution

Introduces macOSWorld, a novel benchmark with 202 multilingual GUI tasks on macOS, including safety evaluation, to advance GUI agent research in a new OS domain.

Findings

01

Proprietary agents outperform open-source models with over 30% success.

02

Open-source models lag below 5% success rate.

03

Multilingual performance drops notably in Arabic, with 28.8% degradation.

Abstract

Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

showlab/macosworld
pytorchOfficial

Videos

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents· slideslive

Taxonomy

TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities