On the redundancy in large material datasets: efficient and robust learning with less data
Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael, Greenwood, Jason Hattrick-Simpers

TL;DR
This paper demonstrates that large material datasets contain significant redundancy, allowing up to 95% of data to be removed without affecting in-distribution predictions, and highlights the importance of data quality over quantity.
Contribution
It provides evidence of redundancy in large material datasets and shows that active learning can create smaller, more informative datasets for robust predictions.
Findings
Up to 95% of data can be removed with minimal impact on in-distribution performance.
Redundant data mainly involves over-represented material types.
Active learning effectively constructs smaller, informative datasets.
Abstract
Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95 % of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the "bigger is better"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Machine Learning and Algorithms · Machine Learning and Data Classification
