A Hitchhiker's Guide to Deep Chemical Language Processing for Bioactivity Prediction
R{\i}za \"Oz\c{c}elik, Francesca Grisoni

TL;DR
This paper provides a comprehensive analysis of deep learning methods for chemical language processing in bioactivity prediction, offering practical guidelines and insights into effective model training across various datasets and representations.
Contribution
It systematically evaluates key elements of CLP training, comparing architectures, representations, and strategies to guide researchers in optimizing bioactivity prediction models.
Findings
Certain neural network architectures outperform others in bioactivity tasks.
Molecular representations like SMILES and SELFIES have different impacts on model performance.
Hyperparameter optimization significantly improves predictive accuracy.
Abstract
Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP learns from molecular string representations (e.g., Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP training, to provide guidelines for newcomers and experts alike. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological choices, but it also equips researchers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods
