Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu,, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu,, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin,, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen

TL;DR
Spider2-V introduces a comprehensive benchmark to evaluate multimodal agents' ability to automate complex data science workflows, revealing current limitations and guiding future improvements in AI automation tools.
Contribution
This paper presents the first multimodal agent benchmark for enterprise data workflows, including real-world tasks, evaluation metrics, and insights into current agent performance.
Findings
Existing agents achieve only 14% success in full workflows.
Agents struggle with fine-grained GUI actions, with only 16.2% success.
Performance drops further to 10.6% in cloud-based environments.
Abstract
Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies
