TL;DR
OS-SPEAR is a comprehensive toolkit designed to systematically evaluate OS agents across safety, performance, efficiency, and robustness, addressing current benchmark limitations and revealing key insights into agent capabilities.
Contribution
We introduce OS-SPEAR, a multidimensional evaluation framework with specialized subsets and automated diagnostics, enabling rigorous analysis of OS agents' safety, performance, efficiency, and robustness.
Findings
Trade-off observed between efficiency and safety or robustness.
Specialized agents outperform general-purpose models.
Robustness vulnerabilities vary across modalities.
Abstract
The evolution of Multimodal Large Language Models (MLLMs) has shifted the focus from text generation to active behavioral execution, particularly via OS agents navigating complex GUIs. However, the transition of these agents into trustworthy daily partners is hindered by a lack of rigorous evaluation regarding safety, efficiency, and multi-modal robustness. Current benchmarks suffer from narrow safety scenarios, noisy trajectory labeling, and limited robustness metrics. To bridge this gap, we propose OS-SPEAR, a comprehensive toolkit for the systematic analysis of OS agents across four dimensions: Safety, Performance, Efficiency, and Robustness. OS-SPEAR introduces four specialized subsets: (1) a S(afety)-subset encompassing diverse environment- and human-induced hazards; (2) a P(erformance)-subset curated via trajectory value estimation and stratified sampling; (3) an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
