Understanding the Structure of QM7b and QM9 Quantum Mechanical Datasets Using Unsupervised Learning
Julio J. Vald\'es, Alain B. Tchagang

TL;DR
This study investigates the internal structure of QM7b and QM9 quantum datasets using unsupervised learning techniques, revealing their intrinsic dimensions, clustering patterns, and outlier characteristics relevant for molecular property prediction.
Contribution
It introduces an analysis of the datasets' structure through intrinsic dimension, clustering, and outlier detection, highlighting differences and similarities important for inverse molecular design.
Findings
QM7b data forms well-defined clusters related to atomic composition.
QM9 data has an outer outlier region and an inner clustered core.
Predictability of molecular properties remains high despite structural differences.
Abstract
This paper explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from the properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner core region that concentrates clustered, inliner objects. A significant relationship exists between the number of atoms in the molecule and its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Protein Structure and Dynamics
