OS-MAP: How Far Can Computer-Using Agents Go in Breadth and Depth?
Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, Bowen Zhou

TL;DR
OS-MAP introduces a comprehensive benchmark with 416 tasks across 15 applications, evaluating computer-using agents' automation levels and generalization scope to identify current limitations and guide future development.
Contribution
The paper presents OS-MAP, a novel benchmark that systematically assesses agent capabilities across multiple dimensions, addressing gaps in existing benchmarks and aligning evaluation with real-world user demands.
Findings
State-of-the-Art agents struggle with high-level tasks involving perception and reasoning.
Current agents show limited generalization across complex demand hierarchies.
Benchmark reveals significant gaps in agent autonomy and adaptability.
Abstract
Computer-using agents have shown strong potential to boost human productivity and enable new application forms across platforms. While recent advances have led to usable applications, existing benchmarks fail to account for the internal task heterogeneity and the corresponding agent capabilities, as well as their alignment with actual user demands-hindering both targeted capability development and the reliable transition of research progress into practical deployment. To bridge the gap, we present OS-MAP, a benchmark for daily computer-using automation that organizes its 416 realistic tasks across 15 applications along two key dimensions: a five-level taxonomy of automation and a generalization scope derived from a real-world user demand hierarchy. To enable fine-grained analysis of required capabilities and alignment with real-world scenarios, OS-MAP evaluates agents along two…
Peer Reviews
Decision·Submitted to ICLR 2026
- The core contribution—the two-dimensional evaluation matrix of Automation Level vs. Generalization Scope—is a step forward for agent evaluation. It provides a structured, principled way to move beyond simple aggregate success rates. - The "Generalization Scope" dimension is based on a well-researched user demand hierarchy, adapting mobile and desktop usage statistics. This demand-driven approach to task curation ensures that the benchmark's relevance is tied to practical utility, a crucial a
- The Conflation of Task Complexity and Agent Autonomy in the "L-Levels" is some what misleading. The framework's "Automation Levels" are inspired by the SAE levels for driving, but there's a crucial difference. In driving, the core task ("go from point A to point B safely") is constant. The levels define the division of labor and the operational design domain. In OS-MAP, the tasks themselves become fundamentally more complex at higher L-levels. L1 involves atomic actions, while L4 involves mult
1. **Unified capability framework.** The paper proposes an original two-axis evaluation scheme—automation level × generalization scope —that provides a structured view of agent competence and offers a clear roadmap for future development. 2. **High-fidelity and reproducible benchmark.** OS-MAP builds on a virtualized desktop environment that spans realistic domains such as office work, study, and system management. The setup ensures controlled reproducibility while remaining extensible to new a
1. **Limited evaluation metrics.** The benchmark currently uses binary task-success scores (0/1), ignoring partial progress, efficiency, or robustness. Incorporating richer quantitative indicators—e.g., partial completion ratio, average steps per success, or recovery rate—would provide a more nuanced evaluation. 2. **High manual cost of task construction.** Although OS-MAP follows a standardized six-stage pipeline, creating and validating new tasks still requires substantial human effort and re
- This paper is extremely well written - It provides a very comprehensive benchmark, covering a good range of variety on multiple dimensions - The benchmark enables detailed analysis of agent performance and failure cases - It conducts a very thorough evaluation of popular models and agents and found interesting results.
None. I really like this paper
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Personal Information Management and User Behavior · Social Robot Interaction and HRI
