On the Importance of Pretraining Data Alignment for Atomic Property Prediction
Yasir Ghunaim, Hasan Abed Al Kader Hammoud, Bernard Ghanem

TL;DR
Pretraining on carefully selected, task-aligned datasets can outperform larger, mixed datasets in atomic property prediction, emphasizing the importance of data quality and alignment over sheer quantity.
Contribution
We introduce the Chemical Similarity Index (CSI) to measure dataset alignment and demonstrate that focused pretraining on aligned data yields superior results with less computational effort.
Findings
Aligned pretraining datasets improve downstream performance.
Adding poorly aligned data can harm model accuracy.
Quality of pretraining data outweighs dataset size.
Abstract
This paper challenges the recent paradigm in atomic property prediction that links progress to growing dataset sizes and computational resources. We show that pretraining on a carefully selected task-aligned dataset can match or even surpass large-scale joint pretraining while using only 1/24th of the pretraining budget. We introduce the Chemical Similarity Index (CSI), a simple metric for molecular graphs inspired by the Fr\'echet Inception Distance in computer vision, which quantifies the alignment between upstream pretraining datasets and downstream tasks. By selecting the most aligned dataset with minimal CSI distance, we show that models pretrained on a smaller, focused dataset consistently achieve better performance on downstream tasks than those pretrained on massive, mixed datasets such as JMP. This holds even when the mixed dataset includes the upstream dataset most aligned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHistory and advancements in chemistry · Advanced Materials Characterization Techniques · Geochemistry and Geologic Mapping
