Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study
Pei-Hung Lin, Chunhua Liao, Winson Chen, Tristan Vanderbruggen, Murali, Emani, Hailu Xu

TL;DR
This paper presents a methodology to enhance the FAIRness of HPC datasets and machine learning models, significantly improving their findability, accessibility, interoperability, and reusability, demonstrated through a case study.
Contribution
It introduces a comprehensive, quantitative methodology for assessing and improving the FAIRness of HPC datasets and models, filling a gap in current practices.
Findings
FAIRness improved from 19.1% to 83.0% after applying the methodology.
The methodology includes assessment, suggestions, and validation on a representative dataset.
Effective enhancement of dataset and model FAIRness demonstrated through case study.
Abstract
The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. However, these principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing (HPC). In this paper, we design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques. Our methodology includes a comprehensive, quantitative assessment for elected data, followed by concrete, actionable suggestions to improve FAIRness with respect to common issues related to persistent identifiers, rich metadata descriptions, license and provenance information. Moreover, we select a representative training dataset to evaluate our methodology. The experiment…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Research Data Management Practices · Advanced Data Storage Technologies
