Pre-trained Molecular Language Models with Random Functional Group Masking
Tianhao Peng, Yuchen Li, Xuhong Li, Jiang Bian, Zeke Xie, Ning Sui,, Shahid Mumtaz, Yanwu Xu, Linghe Kong, Haoyi Xiong

TL;DR
This paper introduces \\ours{}, a SMILES-based molecular language model that uses random functional group masking during pre-training to improve structure-aware molecular property prediction, outperforming existing models on multiple benchmarks.
Contribution
The paper proposes a novel SMILES-based pre-training method with functional group masking to enhance structure learning in molecular language models.
Findings
\\ours{} outperforms existing models on 9 of 11 benchmark tasks.
It demonstrates robustness and superior performance across diverse chemical property predictions.
Functional group masking improves the model's ability to infer molecular structures.
Abstract
Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Bioinformatics
