Data Readiness for Scientific AI at Scale
Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

TL;DR
This paper introduces a framework for assessing and improving the readiness of scientific datasets for scalable AI training, focusing on high-performance computing environments across multiple scientific domains.
Contribution
It proposes a two-dimensional Data Readiness framework and a maturity matrix to guide infrastructure development for scalable, reproducible scientific AI.
Findings
Identifies common preprocessing patterns across domains
Defines Data Readiness Levels and Processing Stages for HPC environments
Provides a conceptual framework for data maturity in scientific AI
Abstract
This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
