BERTology of Molecular Property Prediction

Mohammad Mostafanejad; Paul Saxe; T. Daniel Crawford

arXiv:2603.13627·cs.LG·March 17, 2026

BERTology of Molecular Property Prediction

Mohammad Mostafanejad, Paul Saxe, T. Daniel Crawford

PDF

Open Access 1 Datasets

TL;DR

This paper systematically investigates how factors like dataset size and model scale influence chemical language models' performance in molecular property prediction, providing new insights into their effectiveness and underlying mechanisms.

Contribution

It offers a comprehensive experimental analysis of factors affecting CLMs for MPP, addressing inconsistencies and gaps in understanding of their scaling behavior.

Findings

01

Performance varies significantly with dataset size and model scale.

02

Standardization impacts CLM effectiveness in MPP tasks.

03

Provides numerical evidence and insights into mechanisms affecting CLM performance.

Abstract

Chemical language models (CLMs) have emerged as promising competitors to popular classical machine learning models for molecular property prediction (MPP) tasks. However, an increasing number of studies have reported inconsistent and contradictory results for the performance of CLMs across various MPP benchmark tasks. In this study, we conduct and analyze hundreds of meticulously controlled experiments to systematically investigate the effects of various factors, such as dataset size, model size, and standardization, on the pre-training and fine-tuning performance of CLMs for MPP. In the absence of well-established scaling laws for encoder-only masked language models, our aim is to provide comprehensive numerical evidence and a deeper understanding of the underlying mechanisms affecting the performance of CLMs for MPP tasks, some of which appear to be entirely overlooked in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

molssiai-hub/pubchem-04-18-2025
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods · Machine Learning in Bioinformatics