LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases
Martin Siron, Inel Djafar, Ali Ramlaoui, Etienne du Fayette, Amandine Rossello, Edvin Fako, Matthew McDermott, Felix Therrien, Luis Barroso-Luque, Flaviu Cipcigan, Philippe Schwaller, Thomas Wolf, Alexandre Duval

TL;DR
LeMat-Bulk is a large, standardized, and de-duplicated materials database that improves data integration and analysis for materials science using a novel hashing algorithm.
Contribution
We introduce LeMat-Bulk, a unified materials database with a new hashing method for effective de-duplication and standardization across diverse datasets.
Findings
LeMat-Bulk contains over 5.3 million materials entries.
The BAWL hashing algorithm outperforms existing fingerprinting techniques in robustness.
Our methodology enhances data interoperability and analysis of functional-dependent trends.
Abstract
The rapid expansion of materials science databases has driven machine learning-based discovery while also posing challenges in data integration, duplication, and interoperability. Robust standardization and de-duplication methods are needed to address these issues and streamline materials research. We present LeMat-Bulk, a unified dataset combining Materials Project, OQMD, and Alexandria, encompassing over 5.3 million PBE-calculated materials and also representing the largest collection of PBESol and SCAN functional calculations. Our methodology standardizes calculations across databases that utilize different parameters, effectively addressing redundancy and enhancing cross-compatibility. To de-duplicate, we propose a hashing function which we termed the Bonding Algorithm Weisfeiller-Lehman (BAWL). We comprehensively benchmark this fingerprint under atomic noise, lattice strain, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Nanocluster Synthesis and Applications · Advanced Materials Characterization Techniques
