How big is Big Data?
Daniel T. Speckhard, Tim Bechtel, Luca M. Ghiringhelli, Martin Kuban,, Santiago Rigamonti, and Claudia Draxl

TL;DR
This paper explores the meaning and challenges of big data in materials science machine learning, focusing on data quality, model generalization, and infrastructure needs for large datasets.
Contribution
It provides a comprehensive assessment of what constitutes big data in materials science, highlighting challenges and considerations beyond data volume.
Findings
Data quality and veracity are as important as volume.
Model generalization depends on dataset similarity and feature complexity.
Infrastructure is crucial for handling larger datasets and training models.
Abstract
Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data Technologies and Applications
MethodsSparse Evolutionary Training
