Building informative materials datasets beyond targeted objectives
Rafael Espinosa Casta\~neda, Ashley Dale, Hongchen Wang, Yonatan Kurniawan, Hao Wan, Runze Zhang, Adji Bousso Dieng, Kangming Li, Jason Hattrick-Simpers

TL;DR
This paper introduces a diversity-aware framework for constructing materials datasets that maximizes informativeness for specific properties while maintaining broad coverage for future learning, improving prediction performance and reducing bias.
Contribution
The authors propose a novel diversity-aware selection method for dataset construction that enhances informativeness across multiple properties and mitigates cold-start issues in materials discovery.
Findings
Prediction performance on untargeted properties improved by up to 10% with the framework.
Performance on targeted properties can be increased by up to 25% using the proposed method.
Diversity-aware dataset construction prevents degradation of predictive accuracy compared to random sampling.
Abstract
Materials science data collection can be expensive, making the reuse and long-term utility of datasets critical important for future discovery campaigns. In practice, researchers prioritize a subset of properties due to research interests. However, ignoring a subset of outcomes in data collection campaigns potentially generate datasets poorly suited for future learning tasks. Here, we present a framework for dataset construction that maximizes informativeness for target properties of interest while preserving performance on untargeted ones. Our approach uses diversity-aware selection to ensure broad coverage of the materials space. In noisy experimental dataset construction, we find that without our diversity-aware framework, prediction performance on untargeted properties can degrade by up to 40% relative to random sampling, whereas applying our framework yields improvements of up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
