How big is Big Data?

Daniel T. Speckhard; Tim Bechtel; Luca M. Ghiringhelli; Martin Kuban,; Santiago Rigamonti; and Claudia Draxl

arXiv:2405.11404·stat.ML·May 21, 2024

How big is Big Data?

Daniel T. Speckhard, Tim Bechtel, Luca M. Ghiringhelli, Martin Kuban,, Santiago Rigamonti, and Claudia Draxl

PDF

Open Access

TL;DR

This paper explores the meaning and challenges of big data in materials science machine learning, focusing on data quality, model generalization, and infrastructure needs for large datasets.

Contribution

It provides a comprehensive assessment of what constitutes big data in materials science, highlighting challenges and considerations beyond data volume.

Findings

01

Data quality and veracity are as important as volume.

02

Model generalization depends on dataset similarity and feature complexity.

03

Infrastructure is crucial for handling larger datasets and training models.

Abstract

Big data has ushered in a new wave of predictive power using machine learning models. In this work, we assess what {\it big} means in the context of typical materials-science machine-learning problems. This concerns not only data volume, but also data quality and veracity as much as infrastructure issues. With selected examples, we ask (i) how models generalize to similar datasets, (ii) how high-quality datasets can be gathered from heterogenous sources, (iii) how the feature set and complexity of a model can affect expressivity, and (iv) what infrastructure requirements are needed to create larger datasets and train models on them. In sum, we find that big data present unique challenges along very different aspects that should serve to motivate further work.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data Technologies and Applications

MethodsSparse Evolutionary Training