macOSWorld: A Multilingual Interactive Benchmark for GUI Agents
Pei Yang, Hai Ci, and Mike Zheng Shou

TL;DR
macOSWorld is the first comprehensive multilingual benchmark for GUI agents on macOS, evaluating their performance across diverse tasks, languages, and safety vulnerabilities, revealing significant gaps in current models.
Contribution
Introduces macOSWorld, a novel benchmark with 202 multilingual GUI tasks on macOS, including safety evaluation, to advance GUI agent research in a new OS domain.
Findings
Proprietary agents outperform open-source models with over 30% success.
Open-source models lag below 5% success rate.
Multilingual performance drops notably in Arabic, with 28.8% degradation.
Abstract
Graphical User Interface (GUI) agents show promising capabilities for automating computer-use tasks and facilitating accessibility, but existing interactive benchmarks are mostly English-only, covering web-use or Windows, Linux, and Android environments, but not macOS. macOS is a major OS with distinctive GUI patterns and exclusive applications. To bridge the gaps, we present macOSWorld, the first comprehensive benchmark for evaluating GUI agents on macOS. macOSWorld features 202 multilingual interactive tasks across 30 applications (28 macOS-exclusive), with task instructions and OS interfaces offered in 5 languages (English, Chinese, Arabic, Japanese, and Russian). As GUI agents are shown to be vulnerable to deception attacks, macOSWorld also includes a dedicated safety benchmarking subset. Our evaluation on six GUI agents reveals a dramatic gap: proprietary computer-use agents lead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSecurity and Verification in Computing · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities
