The Bearable Lightness of Big Data: Towards Massive Public Datasets in Scientific Machine Learning
Wai Tong Chung, Ki Sung Jung, Jacqueline H. Chen, Matthias, Ihme

TL;DR
This paper explores the use of lossy compression to enable sharing massive scientific datasets, demonstrating that deep learning models remain robust despite data fidelity loss, thus facilitating open access to high-fidelity scientific data.
Contribution
It introduces a framework for large-scale scientific datasets using lossy compression, ensuring data utility for machine learning while addressing storage challenges.
Findings
Deep learning models are robust to lossy compression errors.
Lossy compression enables sharing of petascale CFD datasets.
The proposed framework supports community access to high-fidelity scientific data.
Abstract
In general, large datasets enable deep learning models to perform with good accuracy and generalizability. However, massive high-fidelity simulation datasets (from molecular chemistry, astrophysics, computational fluid dynamics (CFD), etc. can be challenging to curate due to dimensionality and storage constraints. Lossy compression algorithms can help mitigate limitations from storage, as long as the overall data fidelity is preserved. To illustrate this point, we demonstrate that deep learning models, trained and tested on data from a petascale CFD simulation, are robust to errors introduced during lossy compression in a semantic segmentation problem. Our results demonstrate that lossy compression algorithms offer a realistic pathway for exposing high-fidelity scientific data to open-source data repositories for building community datasets. In this paper, we outline, construct, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Advanced Data Storage Technologies · Research Data Management Practices
