WebDS: An End-to-End Benchmark for Web-based Data Science
Ethan Hsu, Hong Meng Yam, Ines Bouissou, Aaron Murali John, Raj Thota, Josh Koe, Vivek Sarath Putta, G K Dharesan, Alexander Spangher, Shikhar Murty, Tenghao Huang, Christopher D. Manning

TL;DR
WebDS is a comprehensive benchmark designed to evaluate web-based data science tasks, highlighting current AI limitations and guiding future improvements in real-world data analytics capabilities.
Contribution
It introduces the first end-to-end web-based data science benchmark with diverse, complex tasks across multiple websites, revealing significant gaps in current AI agent performance.
Findings
Current SOTA agents perform poorly on WebDS tasks.
Humans achieve around 90% accuracy, far above AI agents.
WebDS exposes new failure modes in AI tools.
Abstract
Many real-world data science tasks involve complex web-based interactions: finding appropriate data available on the internet, synthesizing multimodal data from different locations, and producing summarized analyses. Existing web benchmarks often focus on simplistic interactions and often do not require diverse tool-using capabilities. Conversely, traditional data science benchmarks typically concentrate on static, highly structured datasets and do not assess end-to-end workflows that encompass data acquisition, cleaning, analysis, and insight generation. In response, we introduce WebDS, the first end-to-end web-based data science benchmark. It comprises 870 web-based data science tasks across 29 diverse websites from structured government data portals to unstructured news media, challenging agents to perform complex, multi-step, tool-based operations, across heterogeneous data formats,…
Peer Reviews
Decision·ICLR 2026 Poster
1. The benchmark's grounding in practitioner interviews and the inclusion of 29 diverse, data-rich websites ensures the tasks are highly realistic and demand complex generalization, not simple pattern matching. 2. The failure analysis (Table 5) provides excellent, fine-grained insights into unique failure modes not captured by simplistic metrics. Specifically, "Failed Repetition" (due to lacking state-checking heuristics) and "Poor Groundedness" (contradiction between perceived and latent knowl
1. The analysis and outputs primarily focus on text (reports, answers). Since the input sites are described as being rich in graphics and non-textual data, the benchmark's current focus may under-represent the full multimodal synthesis challenge (e.g., generating a visual chart or interpreting a complex image-based trend). 2. The action space is not fixed, allowing researchers flexibility, but the implementation relies on existing WebArena/BrowserGym abstractions. A brief discussion on whether
- The benchmark environment is very comprehensive, covering 29 websites, 10 domains, and 870 tasks. It focuses on the **data science domain**, evaluating the **entire data processing pipeline**. - The test environment is **dockerized**, offering fixed environments, stable experiments, and reproducible results. - The authors use **vision-language models** to analyze complete execution trajectories, providing **richer evaluation metrics** beyond success rate.
- **Lack of innovation:** The benchmark design does not significantly differ from existing mature web agent benchmarks. The main contributions remain in expanding the testing environment and task set. - **Insufficient rigor:** Although using vision-language models for trajectory evaluation is common in GUI agent research, the validation experiments for scoring accuracy and stability (on a 1–5 scale) are rather cursory. The authors only mention comparing 50 human-evaluated tasks, yet the claimed
This work fills a major gap in current evaluationno existing benchmark assesses full end-to-end data science workflows involving both web interaction and analytical reasoning, captures realistic web-based tasks that better reflect real-world data analysis behavior. The authors evaluates a wide range of SOTA models consistently and highlights key performance bottlenecks.
1. The subjective scoring relies on GPT-4o, creating potential evaluation circularity and bias toward similar model families. Including more human evaluations or open-source LLM judges would strengthen reliability. 2. While qualitative error categories are given, there is no quantitative breakdown of which task attributes (multi-hop, tool-use, multi-site) most contribute to failures.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Data Quality and Management · Data Visualization and Analytics
