OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

Zheng Wu; Yi Hua; Zhaoyuan Huang; Chenhao Xue; Yijie Lu; Pengzhou Cheng; Zongru Wu; Lingzhong Dong; Gongshen Liu; Xinghao Jiang; Zhuosheng Zhang

arXiv:2604.24348·cs.CL·April 28, 2026

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

Zheng Wu, Yi Hua, Zhaoyuan Huang, Chenhao Xue, Yijie Lu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Gongshen Liu, Xinghao Jiang, Zhuosheng Zhang

PDF

1 Repo

TL;DR

OS-SPEAR is a comprehensive toolkit designed to systematically evaluate OS agents across safety, performance, efficiency, and robustness, addressing current benchmark limitations and revealing key insights into agent capabilities.

Contribution

We introduce OS-SPEAR, a multidimensional evaluation framework with specialized subsets and automated diagnostics, enabling rigorous analysis of OS agents' safety, performance, efficiency, and robustness.

Findings

01

Trade-off observed between efficiency and safety or robustness.

02

Specialized agents outperform general-purpose models.

03

Robustness vulnerabilities vary across modalities.

Abstract

The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Wuzheng02/OS-SPEAR
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.