Data Readiness for Scientific AI at Scale

Wesley Brewer; Patrick Widener; Valentine Anantharaj; Feiyi Wang; Tom Beck; Arjun Shankar; Sarp Oral

arXiv:2507.23018·cs.AI·August 1, 2025

Data Readiness for Scientific AI at Scale

Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

PDF

TL;DR

This paper introduces a framework for assessing and improving the readiness of scientific datasets for scalable AI training, focusing on high-performance computing environments across multiple scientific domains.

Contribution

It proposes a two-dimensional Data Readiness framework and a maturity matrix to guide infrastructure development for scalable, reproducible scientific AI.

Findings

01

Identifies common preprocessing patterns across domains

02

Defines Data Readiness Levels and Processing Stages for HPC environments

03

Provides a conceptual framework for data maturity in scientific AI

Abstract

This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.