Uncovering Neural Scaling Laws in Molecular Representation Learning
Dingshuo Chen, Yanqiao Zhu, Jieyu Zhang, Yuanqi Du, Zhixun Li, Qiang, Liu, Shu Wu, Liang Wang

TL;DR
This paper investigates how data quantity and quality influence molecular representation learning, revealing power-law scaling laws and evaluating data pruning strategies to enhance learning efficiency in drug discovery tasks.
Contribution
It provides the first comprehensive analysis of neural scaling laws in molecular representation learning from a data-centric perspective, including empirical validation and benchmarking of data pruning methods.
Findings
Power-law relationship between data volume and performance
Data pruning strategies can challenge existing scaling laws
Insights into improving learning efficiency in MRL
Abstract
Molecular Representation Learning (MRL) has emerged as a powerful tool for drug and materials discovery in a variety of tasks such as virtual screening and inverse design. While there has been a surge of interest in advancing model-centric techniques, the influence of both data quantity and quality on molecular representations is not yet clearly understood within this field. In this paper, we delve into the neural scaling behaviors of MRL from a data-centric viewpoint, examining four key dimensions: (1) data modalities, (2) dataset splitting, (3) the role of pre-training, and (4) model capacity. Our empirical studies confirm a consistent power-law relationship between data volume and MRL performance across these dimensions. Additionally, through detailed analysis, we identify potential avenues for improving learning efficiency. To challenge these scaling laws, we adapt seven popular…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Metabolomics and Mass Spectrometry Studies
MethodsPruning
