Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Dominique Beaini, Shenyang Huang, Joao Alex Cunha, Zhiyi Li, Gabriela, Moisescu-Pareja, Oleksandr Dymov, Samuel Maddrell-Mander, Callum McLean,, Frederik Wenkel, Luis M\"uller, Jama Hussein Mohamud, Ali Parviz, Michael, Craig, Micha{\l} Koziarski, Jiarui Lu, Zhaocheng Zhu

TL;DR
This paper introduces large-scale, diverse molecular datasets and a graph learning library to advance foundation models in molecular machine learning, demonstrating improved performance through multi-task training.
Contribution
The work provides seven novel multi-task molecular datasets covering nearly 100 million molecules and over 13 billion labels, along with a new graph learning library and baseline results.
Findings
Performance improves on low-resource biological datasets when trained with quantum data.
Datasets are 300 times larger than OGB-LSC PCQM4Mv2.
Multi-task training shows potential for resource-constrained downstream tasks.
Abstract
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsLib
