MD-HIT: Machine learning for materials property prediction with dataset   redundancy control

Qin Li; Nihang Fu; Sadman Sadeed Omee; Jianjun Hu

arXiv:2307.04351·cond-mat.mtrl-sci·July 11, 2023

MD-HIT: Machine learning for materials property prediction with dataset redundancy control

Qin Li, Nihang Fu, Sadman Sadeed Omee, Jianjun Hu

PDF

Open Access 1 Repo

TL;DR

This paper introduces MD-HIT, a redundancy reduction algorithm for materials datasets, which improves the reliability of machine learning performance evaluation by removing highly similar samples, thus providing more accurate predictions.

Contribution

The paper proposes MD-HIT, a novel redundancy reduction method for materials datasets, addressing overestimated ML performance caused by sample similarity.

Findings

01

Redundancy reduction improves ML performance assessment accuracy.

02

MD-HIT effectively reduces dataset similarity in composition and structure.

03

More reliable ML predictions reflect true model capabilities.

Abstract

Materials datasets are usually featured by the existence of many redundant (highly similar) materials due to the tinkering material design practice over the history of materials research. For example, the materials project database has many perovskite cubic structure materials similar to SrTiO $_{3}$ . This sample redundancy within the dataset makes the random splitting of machine learning model evaluation to fail so that the ML models tend to achieve over-estimated predictive performance which is misleading for the materials science community. This issue is well known in the field of bioinformatics for protein function prediction, in which a redundancy reduction procedure (CD-Hit) is always applied to reduce the sample redundancy by ensuring no pair of samples has a sequence similarity greater than a given threshold. This paper surveys the overestimated ML performance in the literature for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

usccolumbia/md-hit
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Machine Learning in Bioinformatics · Genomics and Phylogenetic Studies

Methodsfail