AgentStudio: A Toolkit for Building General Virtual Agents
Longtao Zheng, Zhiyuan Huang, Zhenghai Xue, Xinrun Wang, Bo An,, Shuicheng Yan

TL;DR
AgentStudio introduces a versatile toolkit with environments, tools, and benchmarks to advance the development and evaluation of general virtual agents capable of handling multimodal data and complex actions in open environments.
Contribution
It provides a lightweight, interactive platform with new datasets and benchmarks for assessing fundamental agent capabilities in GUI interactions, video learning, and success detection.
Findings
Established three new datasets: GroundUI, IDMBench, CriticBench.
Created an online task suite for benchmarking GUI and function calling.
Reorganized existing datasets to support agent evaluation.
Abstract
General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which limits agent development and evaluation in real-world settings. As a result, current evaluations lack in-depth analyses that decompose fundamental agent capabilities. We introduce AgentStudio, a trinity of environments, tools, and benchmarks to address these issues. AgentStudio provides a lightweight, interactive environment with highly generic observation and action spaces, e.g., video observations and GUI/API actions. It integrates tools for creating online benchmark tasks, annotating GUI elements, and labeling actions in videos. Based on our environment and tools, we curate an online task suite that benchmarks both GUI interactions and function…
Peer Reviews
Decision·ICLR 2025 Poster
- This paper provides a lightweight, interactive environment with highly generic observation and action spaces, such as video observations and GUI/API actions, which expand the task space to a massively open domain and real-world tasks. AgentStudio comes with tools for creating and validating benchmark tasks, annotating GUI elements, and labeling actions in videos, which are essential for customizing and validating tasks in real-world settings. - The toolkit enables online interactions for lea
- Although the authors have made the code available in the supplementary materials, it would be beneficial to offer a more detailed guide to assist users in understanding and implementing the benchmark effectively. - The paper's claims are somewhat overstated. While AgentStudio's tasks primarily focus on interactions within 2D graphical user interfaces (GUIs), the capabilities of a general virtual agent extend beyond these to include interactions with 3D virtual environments, such as those foun
1. The paper is well written, easy to follow. 2. The dataset curation process makes sense to me. 3. The experiments are adequate. Compared to previous works, AgentStudio has many advantages including interactivity, supporting data/tasks/tools, supporting language feedback, etc. 4. AgentStudio shows the short coming of existing models. For example, existing models can do pretty well on single API tasks, but very poorly on compositional tasks. 5. Also the benchmark shows that specialized models c
1. Not sure # of tasks is enough. I would love to see the # of tasks can continue to grow. 2. Currently, the software seems to be randomly selected. One potential improvement is that maybe the author can get some statistics of most used software and include the top ones into the Benchmark. For example, I would imagine Photoshop could be an interesting case to add to the evaluation suite.
1. The interactive environment design allowing both GUI and API interactions is valuable 2. It introduces an online task-completion benchmark and three datasets to evaluate fundamental agent abilities in real-world settings. The benchmark suite consists of 205 real-world tasks across various applications such as VS Code, Google Workspace, and Office suites 3. Evaluating current LLM-based agents (Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Qwen-VL-Chat) on real-world software interaction tasks, it pr
1. Limited technical novelty - the environment appears to be largely an integration of existing components without significant new technical contributions. 2. The three datasets created for fine-grained evaluation are quite small in scale: IDMBench, criticBench has only 345, 350 trajectories respectively. 3. Insufficient technical details about the implementation: Only cursory mention of using VNC and Docker No discussion of performance, latency, scalability, reliability considerations
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
