MD-HIT: Machine learning for materials property prediction with dataset redundancy control
Qin Li, Nihang Fu, Sadman Sadeed Omee, Jianjun Hu

TL;DR
This paper introduces MD-HIT, a redundancy reduction algorithm for materials datasets, which improves the reliability of machine learning performance evaluation by removing highly similar samples, thus providing more accurate predictions.
Contribution
The paper proposes MD-HIT, a novel redundancy reduction method for materials datasets, addressing overestimated ML performance caused by sample similarity.
Findings
Redundancy reduction improves ML performance assessment accuracy.
MD-HIT effectively reduces dataset similarity in composition and structure.
More reliable ML predictions reflect true model capabilities.
Abstract
Materials datasets are usually featured by the existence of many redundant (highly similar) materials due to the tinkering material design practice over the history of materials research. For example, the materials project database has many perovskite cubic structure materials similar to SrTiO. This sample redundancy within the dataset makes the random splitting of machine learning model evaluation to fail so that the ML models tend to achieve over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the field of bioinformatics for protein function prediction, in which a redundancy reduction procedure (CD-Hit) is always applied to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given threshold. This paper surveys the overestimated ML performance in the literature for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies
Methodsfail
