MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific   Workflows

Xingjian Zhang; Yutong Xie; Jin Huang; Jinge Ma; Zhaoying Pan; Qijia; Liu; Ziyang Xiong; Tolga Ergen; Dongsub Shim; Honglak Lee; Qiaozhu Mei

arXiv:2406.06357·cs.CL·June 11, 2024·1 cites

MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows

Xingjian Zhang, Yutong Xie, Jin Huang, Jinge Ma, Zhaoying Pan, Qijia, Liu, Ziyang Xiong, Tolga Ergen, Dongsub Shim, Honglak Lee, Qiaozhu Mei

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

MASSW introduces a large, structured dataset of scientific workflows extracted from publications, enabling new AI tasks to improve understanding and navigation of scientific research processes.

Contribution

The paper presents MASSW, a novel dataset with automatically extracted workflow aspects from 152,000 publications, facilitating benchmarking of AI methods for scientific workflow analysis.

Findings

01

Validated LLM-extracted summaries against human annotations

02

Demonstrated multiple downstream tasks using MASSW

03

Showed potential for AI to enhance scientific innovation

Abstract

Scientific innovation relies on detailed workflows, which include critical steps such as analyzing literature, generating ideas, validating these ideas, interpreting results, and inspiring follow-up research. However, scientific publications that document these workflows are extensive and unstructured. This makes it difficult for both human researchers and AI systems to effectively navigate and explore the space of scientific innovation. To address this issue, we introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of Scientific Workflows. MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years. Using Large Language Models (LLMs), we automatically extract five core aspects from these publications -- context, key idea, method, outcome, and projected impact -- which correspond to five key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xingjian-zhang/massw
noneOfficial

Datasets

jimmyzxj/massw
dataset· 18 dl
18 dl

Videos

MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows· underline

Taxonomy

TopicsScientific Computing and Data Management · Research Data Management Practices