BERTology of Molecular Property Prediction
Mohammad Mostafanejad, Paul Saxe, T. Daniel Crawford

TL;DR
This paper systematically investigates how factors like dataset size and model scale influence chemical language models' performance in molecular property prediction, providing new insights into their effectiveness and underlying mechanisms.
Contribution
It offers a comprehensive experimental analysis of factors affecting CLMs for MPP, addressing inconsistencies and gaps in understanding of their scaling behavior.
Findings
Performance varies significantly with dataset size and model scale.
Standardization impacts CLM effectiveness in MPP tasks.
Provides numerical evidence and insights into mechanisms affecting CLM performance.
Abstract
Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Machine Learning in Bioinformatics
