CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li

TL;DR
Crab is a novel benchmark framework for evaluating multimodal language model agents across multiple environments, featuring a graph-based evaluation and supporting diverse device platforms.
Contribution
Introduces Crab, the first cross-environment agent benchmark with a fine-grained evaluation method and flexible task construction, enabling comprehensive MLM agent assessment.
Findings
Single agent with GPT-4o achieves 38.01% completion ratio.
Crab supports 120 tasks across desktop and mobile environments.
Framework and datasets are publicly available.
Abstract
The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI environments, such as websites, desktop computers, or mobile phones. Existing benchmarks for MLM agents in interactive environments are limited by their focus on a single environment, lack of detailed and generalized evaluation methods, and the complexities of constructing tasks and evaluators. To overcome these limitations, we introduce Crab, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging Crab, we developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in…
Peer Reviews
Decision·Submitted to ICLR 2025
Originality & Quality: CRAB is the first benchmark to incorporate cross-environment tasks, reflecting real-world scenarios. Novel graph evaluator and sub-task composition methods address the limitations of existing evaluation methods and benchmarks. Clarity: Contain detailed descriptions of framework design, task dataset, agent implementations. Visual case studies provide examples to illustrate agent performance on specific tasks. Good analysis of results by platform, model, agent structure, an
- Limited coverage of applications. It focuses on original apps in Ubuntu & Android on Pixel devices. Expanding to more apps and devices would further improve real-world coverage. The number of data instances (120 tasks) is relatively small. While the tasks are designed to be more complex than those in other benchmarks, extending the dataset to around 500 instances would enable more comprehensive and statistically significant comparisons between different models and agent settings. A larger data
A key contribution of the paper is the cross-platform evaluation capability, allowing MLMs to be assessed across desktop and mobile platforms. The graph-based task decomposition enables flexible, multi-path task completion, and the detailed evaluation metrics offer a comprehensive assessment of agent performance. The diverse experimental setup with various MLM models shows CRAB’s broad applicability.
1. The paper lacks sufficient review of existing work, missing comparisons with related frameworks such as Mobile-ENV [1] and GUI-World [2]. 2. Table 1 only presents the number of apps, whereas similar papers like AndroidWorld [3] provide detailed counts of tasks and templates per app. Additionally, "osworld" appears twice in the table, which could cause confusion and detracts from a thorough task-level comparison. 3. The explanation of GDT (Graph of Decomposed Tasks) is unclear; it appears more
1. CRAB's ability to handle tasks across different environments is an advancement, reflecting the multi-platform nature of real-world applications. 2. The graph evaluator provides a nuanced and detailed assessment of agent performance, capturing intermediate progress and multiple valid pathways to task completion. 3. The framework's design allows for easy adaptation to new platforms and devices, and its sub-task composition method streamlines task creation. 4. The inclusion of 120 real-world ta
This following comment reflects my personal bias, if the authors can convince me, I would be happy to change my score: Although web agent recently is a hot topic, I feel that it is an engineering problem, not a research problem. In another words, as long we we gather enough web interaction data, all kinds of problems listed in those papers in the related works will not exist anymore. Therefore, just like 10 years ago, we spend quite a lot of effort in tweaking different models and benchmarks to
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Multi-Agent Systems and Negotiation · Speech and dialogue systems
MethodsFocus
