Unveiling Scaling Behaviors in Molecular Language Models: Effects of Model Size, Data, and Representation
Dong Xu, Qihua Pan, Sisi Yuan, Jianqiang Li, Zexuan Zhu, Junkai Ji

TL;DR
This paper systematically investigates how molecular language models scale with size, data, and representation, revealing predictable scaling laws and the significant influence of molecular representation on performance.
Contribution
It provides the first comprehensive analysis of scaling behaviors in molecular language models, including a large-scale experimental study and public release of models and code.
Findings
Clear scaling laws for molecular models in pretraining and transfer tasks
Molecular representation significantly impacts model performance
Reveals reasons for previous inconsistencies in molecular generation scaling
Abstract
Molecular generative models, often employing GPT-style language modeling on molecular string representations, have shown promising capabilities when scaled to large datasets and model sizes. However, it remains unclear and subject to debate whether these models adhere to predictable scaling laws under fixed computational budgets, which is a crucial understanding for optimally allocating resources between model size, data volume, and molecular representation. In this study, we systematically investigate the scaling behavior of molecular language models across both pretraining and downstream tasks. We train 300 models and conduct over 10,000 experiments, rigorously controlling compute budgets while independently varying model size, number of training tokens, and molecular representation. Our results demonstrate clear scaling laws in molecular models for both pretraining and downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Advanced Graph Neural Networks
