Molecular Quantum Chemical Data Sets and Databases for Machine Learning Potentials

Arif Ullah; Yuxinxin Chen; Pavlo O. Dral

arXiv:2408.12058·physics.chem-ph·August 13, 2025·Mach. Learn. Sci. Technol.

Molecular Quantum Chemical Data Sets and Databases for Machine Learning Potentials

Arif Ullah, Yuxinxin Chen, Pavlo O. Dral

PDF

1 Repo

TL;DR

This review discusses the landscape of quantum chemical data sets and databases crucial for training machine learning potentials in computational chemistry, highlighting their characteristics, challenges, and future needs.

Contribution

It provides a comprehensive overview of existing quantum chemical data resources and emphasizes the importance of standardization, accessibility, and sustainability for future development.

Findings

01

Key data sets vary in chemical diversity and electronic structure methods used.

02

Challenges include data growth, standardization, and long-term accessibility.

03

Recommendations focus on developing sustainable, interoperable, and user-friendly data platforms.

Abstract

The field of computational chemistry is increasingly leveraging machine learning (ML) potentials to predict molecular properties with high accuracy and efficiency, providing a viable alternative to traditional quantum mechanical (QM) methods, which are often computationally intensive. Central to the success of ML models is the quality and comprehensiveness of the data sets on which they are trained. Quantum chemistry data sets and databases, comprising extensive information on molecular structures, energies, forces, and other properties derived from QM calculations, are crucial for developing robust and generalizable ML potentials. In this review, we provide an overview of the current landscape of quantum chemical data sets and databases. We examine key characteristics and functionalities of prominent resources, including the types of information they store, the level of electronic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arif-phychem/datasets_and_databases_4_mlps
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.