Spider2-V: How Far Are Multimodal Agents From Automating Data Science   and Engineering Workflows?

Ruisheng Cao; Fangyu Lei; Haoyuan Wu; Jixuan Chen; Yeqiao Fu,; Hongcheng Gao; Xinzhuang Xiong; Hanchong Zhang; Yuchen Mao; Wenjing Hu,; Tianbao Xie; Hongshen Xu; Danyang Zhang; Sida Wang; Ruoxi Sun; Pengcheng Yin,; Caiming Xiong; Ansong Ni; Qian Liu; Victor Zhong; Lu Chen; Kai Yu; Tao Yu

arXiv:2407.10956·cs.AI·July 16, 2024·2 cites

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu,, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu,, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin,, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

Spider2-V introduces a comprehensive benchmark to evaluate multimodal agents' ability to automate complex data science workflows, revealing current limitations and guiding future improvements in AI automation tools.

Contribution

This paper presents the first multimodal agent benchmark for enterprise data workflows, including real-world tasks, evaluation metrics, and insights into current agent performance.

Findings

01

Existing agents achieve only 14% success in full workflows.

02

Agents struggle with fine-grained GUI actions, with only 16.2% success.

03

Performance drops further to 10.6% in cloud-based environments.

Abstract

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xlang-ai/spider2-v
noneOfficial

Datasets

xlangai/ubuntu_spider2v
dataset· 126 dl
126 dl

Videos

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?· slideslive

Taxonomy

TopicsSemantic Web and Ontologies