TL;DR
SMILES-X is an autonomous neural architecture that effectively predicts molecular properties from SMILES strings, especially useful for small datasets, eliminating the need for handcrafted descriptors and providing interpretable results.
Contribution
It introduces a novel Embed-Encode-Attend-Predict neural pipeline with Bayesian hyper-parameter optimization for small dataset molecular property prediction without descriptors.
Findings
Achieves state-of-the-art results in solubility, hydration free energy, and LogD predictions.
Effectively augments data using SMILES de-canonicalization.
Provides interpretable attention-based predictions.
Abstract
There is more and more evidence that machine learning can be successfully applied in materials science and related fields. However, datasets in these fields are often quite small ( samples). It makes the most advanced machine learning techniques remain neglected, as they are considered to be applicable to big data only. Moreover, materials informatics methods often rely on human-engineered descriptors, that should be carefully chosen, or even created, to fit the physicochemical property that one intends to predict. In this article, we propose a new method that tackles both the issue of small datasets and the difficulty of task-specific descriptors development. The SMILES-X is an autonomous pipeline for molecular compounds characterisation based on a \{Embed-Encode-Attend-Predict\} neural architecture with a data-specific Bayesian hyper-parameters optimisation. The only input to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
SMILES-X:
autonomous molecular compounds characterisation for small datasets without descriptors
Guillaume Lambard
Research and Services Division of Materials Data and Integrated System, Energy Materials Design Group, National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki, 305-0047, Japan.
Corresponding author: [email protected]
Ekaterina Gracheva
International Center for Materials Nanoarchitectonics, National Institute for Materials Science, 1-2-1 Sengen, Tsukuba, Ibaraki, 305-0047 Japan.
University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577 Japan
Abstract
There is more and more evidence that machine learning can be successfully applied in materials science and related fields. However, datasets in these fields are often quite small ( samples). It makes the most advanced machine learning techniques remain neglected, as they are considered to be applicable to big data only. Moreover, materials informatics methods often rely on human-engineered descriptors, that should be carefully chosen, or even created, to fit the physicochemical property that one intends to predict. In this article, we propose a new method that tackles both the issue of small datasets and the difficulty of task-specific descriptors development. The SMILES-X is an autonomous pipeline for molecular compounds characterisation based on a {Embed-Encode-Attend-Predict} neural architecture with a data-specific Bayesian hyper-parameters optimisation. The only input to the architecture — the SMILES strings — are de-canonicalised in order to efficiently augment the data. One of the key features of the architecture is the attention mechanism, which enables the interpretation of output predictions without extra computational cost. The SMILES-X shows new state-of-the-art results in the inference of aqueous solubility ( mols/L), hydration free energy ( kcal/mol, which is better than molecular dynamics simulations), and octanol/water distribution coefficient ( for LogD at pH 7.4) of molecular compounds. The SMILES-X is intended to become an important asset in the toolkit of materials scientists and chemists. The source code for the SMILES-X is available at github.com/GLambard/SMILES-X.
K****eywords Cheminformatics Small molecules SMILES Descriptors Natural language processing Machine learning Neural architecture Attention mechanism Small datasets
1 Introduction
In the fields of bio- and cheminformatics, machine learning (ML) algorithms combined with human-engineered molecular descriptors1, 2 have shown great potential in tasks of predicting physicochemical properties of molecular compounds. In practice, however, it is often necessary to run a blind scan through a large number of such combinations in order to find the most accurate inference model, which still may not lead to success. Most of the descriptors are task- or domain-specific, which makes their use impossible for more general problems, such as virtual screening, similarity searching, clustering and structure-activity modelling3, 4, 5, 6.
For these purposes molecular fingerprints have been developed. Fingerprint is a binary representation of a molecule: its structural or functional features are translated into a string of bits in the way to keep the fingerprint invariant to rotations, translations and property-preserving atomic permutations (see, e.g., extended circular fingerprints7). Even though molecular fingerprints are known to be helpful to drugs discovery or compounds search among various databases, they may as well be detrimental to materials characterisation and design. Therefore, while both descriptors and fingerprints may be beneficial, they come along with restrictions.
In fields like materials science it is common to have datasets with samples, which is considered to be too small for a direct deep learning application. Some research groups use neural architectures (NAs) for secondary tasks such as to build novel high-level features as non-linear combinations of molecular descriptors8, 9, 10. Others use NA to automatically learn features based on 2D/3D images11, 12, molecular graphs13, SMILES (simplified molecular input line entry system)14, 15, 16, N-gram graphs17 or a combination of mentioned inputs18, similar to computer vision (CV). Still, none of them intends to design an NA for property prediction on small datasets. There are some works on transfer learning19, 20, 11, but the results vary greatly depending on the correlation between the tasks – which is often unknown a priori. Moreover, most of the NAs used in the fields of CV or natural language processing (NLP) are trained on big data and impose architectures that do not fit small datasets.
Aside from the lack of data, another bottleneck on the way of using NAs in physics and chemistry is the lack of interpretability. A method for explaining neural networks has been recently proposed15. It consists in training an additional neural network to generate a mask identifying the most important SMILES characters. Despite the respectable coherence in the interpretation of the chemical solubility, the explanation network is entirely correlated to its prediction network, which forces the training phase to be doubled for each dataset. Moreover, even though the explanation network allows to identify the groups having the highest weight in the property prediction, there is no evidence that the original prediction network has also learned the known chemistry concepts in order to make proper characterisation.
In this article we propose a method allowing to overpass the issues of data scarcity, descriptors engineering and the prediction interpretation ambiguity at the same time. The algorithm benefits from the natural ability of NAs to learn a suitable and task-specific representation of the data. It designs a simple yet effective NA dedicated to small datasets based on attention mechanism21, 22, 23. To achieve this, we borrowed the latest techniques from the CV and NLP fields to build an entirely autonomous system – the SMILES-X. To the best of our knowledge, this is the first time in materials science related fields when an NA is specifically designed to manage small datasets, and the first attempt to integrate a NLP-based attention mechanism for predicting physicochemical properties of molecular compounds. This mechanism allows to reduce the number of trainable parameters, and provides the interpretation of the results at no extra cost. The SMILES-X achieves the state-of-the-art results, predicting any physicochemical property given the molecule’s SMILES24, 25 as the sole input.
The structure of the article is as follows. First, we describe the entire pipeline of the SMILES-X in Section 2. The SMILES augmentation and formatting are detailed in subsections 2.1, 2.2, respectively, while the procedures of building the NA frame and its data-specific optimisation are presented in the subsection 2.3. The subsection 3.1 is dedicated to the performance of the SMILES-X based on three benchmark datasets for regression tasks from the MoleculeNet26: ESOL27, FreeSolv28 and Lipophilicity29. There are three modes of interpretation of the results of the SMILES-X, which are discussed in the subsection 3.2. Finally, we conclude and discuss further possible improvements of the SMILES-X, as well as propose more potential target properties to be inferred using the algorithm in Section 4.
2 The SMILES-X pipeline
The SMILES-X has been conceived to meet the following requirements: (i) to use the SMILES format as the only representation of a molecular compound; computable characteristics, such as the fingerprints or physical descriptors, are left out. (ii) Remove the SMILES canonicalization24 in order to exploit the full capacity of the molecular compound representation. (iii) The core architecture is simple enough to handle small datasets without sacrificing the prediction accuracy. (iv) Outcomes of the SMILES-X are interpretable.
Figure 1 is a sketch of the main steps within the SMILES-X pipeline. The primary input is a list of SMILES strings with corresponding property values. Then, a splitting into training, validation and test sets is performed via equiprobable sampling. The subsequent steps are detailed below.
2.1 Augmentation
It has been shown in CV that data augmentation approaches such as flipping, rotation, scaling, cropping and other image transformations are effective to reduce the error rate on classification tasks and improve generalisation30. Here, we introduce a technique called SMILES augmentation, similar to Bjerrum14. The first step consists in removing canonicalization24 of the SMILES. Canonicalization is the default procedure to standardise the SMILES across the databases, therefore removing it leads to an expanded number of SMILES individual representations. Then, augmentation is done by iterating over the following two steps: (i) Renumber the atoms of a given SMILES by rotation of their index. (ii) For each renumbering, reconstruct grammatically correct SMILES under the condition of conserving the initial molecule’s isomerism and prohibiting kekulisation24, 25. In the end, one obtains an expanded list of SMILES together with their corresponding property and cardinality (number of augmentations for a SMILES ), if any. Duplicated SMILES are removed. The SMILES augmentation is individually performed after splitting into training, validation and test sets to avoid any information leakage. The procedure is performed using the RDKit library31.
2.2 Tokenisation
Tokenisation consists in dividing the SMILES into unique tokens, each token being a set of characters. The procedure of SMILES tokenisation is as follows24, 25: (i) Aliphatic and aromatic organic atoms (B, C, N, O, S, P, F, Cl, Br, I, b, c, n, o, s, p), bounds, branches and rings (-, =, #, $, /, , ., (, ), %digits, digit) are set as individual tokens. (ii) The characters between squared brackets, that may include inorganic and aromatic organic atoms, isotopes, chirality, hydrogen count, charges or class number, form a single token (brackets included, e.g., [NH4+]). (iii) Unlike the NLP analysis, the beginning token is not different from the termination one: both of them are represented by a whitespace, which is added at both ends of a tokenized SMILES. This is important to keep its reading direction invariant. Finally, a set of unique tokens is extracted to form the representative chemical vocabulary for a given dataset. To become an interpretable NA input, this vocabulary is then mapped into integers, and is conserved into memory for future usage.
2.3 Architecture search
The neural architecture search has recently reached a new milestone in finding the optimal NA for a given task, by using, e.g., reinforcement learning techniques32, 33 or evolutionary algorithms34. However, not only these techniques are computationally expensive but also they do not necessarily deal with the recurrent blocks. It has therefore been decided to fix the overall NA geometry (Figure 2) and search for the best set of the hyperparameters through the Bayesian optimisation35. As it was mentioned earlier in Section 2, this geometry is NLP-oriented and treats SMILES strings as sentences in the chemical language; it has low complexity so as to be applicable to small datasets, and its outcomes are interpretable. Inspired by the hierarchical neural architecture36, which allows to get cutting edge results on document classification, we have built the SMILES-X frame based on a four-step formula: {Embed, Encode, Attend, Predict}37.
Embed The embedding layer38 transforms the tokens, derived from the dataset’s vocabulary in form of integers, into dense -dimensional float vectors. Unlike arbitrary ordinal numbers, these vectors encapsulate the semantic meaning of tokens and their relations. This operation transforms SMILES into series of vectors, or tensor, where corresponds to the number of tokens in a tokenised SMILES string. 2. 2.
Encode The encoding phase is responsible for modifying the embedding, so that it captures the relationships between tokens in the context of the dataset. It consists of two neural layers: a bidirectional CuDNN long short-term memory (LSTM) layer39, 40 is followed by a time-distributed fully connected one. The former consists of LSTM blocks and maps the input SMILES, represented now by a tensor, into a context-aware tensor. After training, each row of the tensor represents the meaning of a given token within the context of the rest of the SMILES string containing it. The bidirectionality forces the embedded SMILES to be sequentially passed forwards and backwards, conserving the invariance of their reading direction. The forward and backward encodings of a SMILES are then concatenated, resulting in a output tensor. The time-distributed dense layer is then applied to each of tokens. This allows to capture the relationships between tokens in greater detail, or in other words to deepen the LSTM layer (similar to the effect of adding an extra dense layer to a vanilla neural network). Given that the number of hidden units in this layer is , the output after encoding is a tensor. It should be noted that we specifically use CuDNN LSTM41 blocks for efficient optimization and training phases on GPU from NVIDIA Corporation. Without the CuDNN version of LSTM, the speed of training would drop by a factor of , making the optimisation phase intractable. 3. 3.
Attend The attention layer detects the salient tokens, compressing tensor into an vector c with minimum information loss23:
[TABLE]
where and are trainable parameters, is the attention vector and is the output. Thus, the attention layer performs two important tasks at once: (1) it collapses the representation of a variable length chain of tokens into a fixed length vector c by applying a weighted sum over the tokens to fit the final property best, with (2) the weights in which represent the importance of each token towards the final property prediction, bringing to a straightforward interpretation. Therefore, the attention layer has two modes, one returning the output vector c, and the other – the attention vector (see Section 3). The two modes are switchable at will without extra computational cost. 4. 4.
Predict The final NA layer transforms the attention layer output c into a single property value by a simple linear operation:
[TABLE]
The interpretation from in Equation 1 and the prediction are thus linearly connected and are accessible without any additional treatments on the input data or NA, unlike the pipelines in other works42, 43, 15.
It should be noted that all the above tensors or vectors have one additional dimension, , omitted for the sake of simplicity. This dimension corresponds to the batch size of a single iteration passed to the network, i.e. the maximum number of SMILES that it processes at once. All of the steps above are implemented in Keras API44 and Tensorflow45 with GPU support.
3 Results & discussion
To evaluate the regression performance of the SMILES-X, it was chosen to test it on three benchmark physical chemistry datasets issued from the MoleculeNet26. These datasets are considered as small, with less than 5000 compound-property pairs, and therefore present a challenge to machine learning models. The ESOL27 dataset contains the logarithmic aqueous solubility (mols/L) for 1128 organic small molecules; the FreeSolv28 consists of the calculated and experimental hydration free energies (kcal/mol) for 642 small neutral molecules in water; and the Lipophilicity29 stores the experimental data on octanol/water distribution coefficient (logD at pH 7.4) for 4200 compounds.
In present report the splitting ratio for training/validation/test is set to 0.8/0.1/0.1. Following the procedure from MoleculeNet26, we performed 8 splits, each time using new seed for the Monte-Carlo sampling. The seeds have been fixed for the sake of reproducibility. We use the averaged RMSE over the 8 test sets as the comparison metric of performance.
The optimal model architecture is determined via Bayesian optimisation individually for each split. We used the python library GPyOpt46 for this purpose. The search bounds are as follows: ( with a step of 0.1, where is related to the optimiser learning rate as , making a total of 50421 configurations. For the Lipophilicity dataset, and learning rate are fixed to 1024 and , respectively, leaving 343 potential architectures to search among. First, 25 architectures are randomly sampled and trained. Then, a maximum of 25 architectures are proposed via the expected improvement acquisition function47. Each of the architectures are sequentially trained for 30 epochs for ESOL and FreeSolv, and 10 for the Lipophilicity set (these values have been chosen based on the speed/efficiency ratio). The best proposed architecture is finally trained using a standard Adam optimiser48 with checkpoint and early stopping. The early stopping is configured to stop the training if the validation loss is not improving for 50 consecutive epochs, and a checkpoint saves the parameters of the model with the minimal validation loss. The maximum number of epochs is set to 300, but because of the early stopping condition this value has never been reached. Depending on whether the SMILES augmentation is requested or not, the code needs from 1 to 4 GPUs running in parallel.
3.1 Predictions
We compare the performance of SMILES-X against the best-to-date results from MoleculeNet26, and for the FreeSolv additionally to the calculations based on the molecular dynamics simulations28 (Table 1). The results in MoleculeNet26 are reported for the molecular graph-based models that achieved the best results on a given dataset: concretely, a message passing neural network49 for the ESOL and FreeSolv datasets, and a graph convolutional model50 for the Lipophilicity dataset. Bayesian optimisation is also used there for the layers size, batch size and learning rate. We include both the results on canonicalised SMILES (Can) and on SMILES that have been augmented (Augm) (see Section 2.3). When a SMILES string is augmented to strings, its predicted property value is averaged over predictions. Table 1 shows that the SMILES-X reaches the best results for the FreeSolv and Lipophilicity datasets, improving the prediction accuracy by 30% and 9%, respectively, while having a comparable performance on the ESOL data. It is unclear why our algorithm fails to improve on the ESOL data. We thought that the number of tokens per SMILES may be the culprit. However, Figure 3 shows that this is not the case. Note that even using the standard canonicalised SMILES strings, the property can be predicted quite well without employing any chemical knowledge (i.e., using no descriptors). Interestingly, machine learning allows to achieve a better accuracy than the molecular dynamics simulations.
There are the three main reasons that we think permitted SMILES-X to achieve these results:
- i.
The success is mainly attributed to the attention layer, that shows similar improvements in document classification tasks36. Comparing our performance to a similar NA without an attention layer15, we see some 32.5% improvement on accuracy. 2. ii.
Bayesian optimisation is a valuable tool that allows to efficiently find the best hyper-parameters in a short time. 3. iii.
It is obvious that SMILES augmentation shows great improvement (Can versus Augm in Table 1), and was necessary to achieve the best current results. Also, one can note that a graph-based NA would not allow such data augmentation.
3.2 Interpretability
As it was mentioned before, one of the great advantages of our method is its interpretability. The Figure 4 shows an example of the trained token embeddings. We used a principal component analysis (PCA51, 52) to reduce dimensionality from down to two, for the purpose of visualisation. The tokens that are not included in the training set, and are therefore randomly assigned, are represented by a cross. One can see that halogens Br, F, Cl are located near each other. Other distinguishable sets are, for example, and , that have the same valence and bonds type within the group. The model also puts close to each other, which reveals their regular coexistence in compounds within the FreeSolv data. Some other tokens placements, however, are not obvious to chemically qualify. In any case, the principle aim of clustering is to smooth out the chemical relations; it serves as a trainable look-up table for the further context-aware processing of tokens. We should not, thus, expect too great a degree of interpretability at this step. Representation of the individual tokens out of their chemical context is not the objective of the SMILES-X.
Instead, we are interested in the interpretation of the network property prediction. With the SMILES-X, we are able to visualise the importance of each single token towards the final prediction of the property of interest (Figure 5).
There are three ways of visualisation available: (a) a 1D map built from the attention vector (see Equation 1) juxtaposed with the SMILES string, (b) a similar 2D version for the molecular graph and (c) temporal relative distance to the predicted property. For the first two, the redder and darker the colour is the stronger is the attention on a given token.
shows the evolution of the prediction for the SMILES while reading it token by token from left to right. It is inspired by Lanchantin53 and defined as:
[TABLE]
where Prop(n) is the property predicted value based on the first n tokens of the SMILES for . Note that it converges to the final prediction (prediction based on the entire SMILES). This also allows to judge as to how much a token influences the property of a compound. In this example, the prediction based on fragment ’Cc1ccc(O’ is almost identical to the final prediction on the whole structure.
For the compound that we used as an example, the oxygen atom (’O’) is considered to be the most influential element of the molecule for the hydration free energy prediction, which reflects chemical reality.
4 Conclusions
A new neural architecture for the chemical compounds characterisation, the SMILES-X, has been developed. In this article, we have presented the pipeline and performance of the SMILES-X. We demonstrate its aptitude to provide state-of-the-art results on the inference of several physicochemical properties, concretely the logarithmic aqueous solubility ( mols/L), hydration free energy ( kcal/mol) and octanol/water distribution coefficient ( for LogD at pH 7.4). These results prove that it is now possible to successfully predict a physicochemical property employing no chemical intuition, even with a small dataset at hand. The success of the SMILES-X rides on three key factors: (i) The Embed-Encode-Attend-Predict architecture, that allows to simplify the whole architecture thanks to the attention mechanism (i.e., to have less trainable parameters), and therefore reduces the risk of over-fitting. (ii) The Bayesian optimisation of the neural network’s hyper-parameters allows to achieve close-to-optimal representation of the molecular compounds, per task and dataset. (iii) The use of SMILES strings as a sole input representation of chemical compounds allows efficient data augmentation.
Thanks to the attention mechanism, the SMILES-X comes with three modes of interpretation of the inference outcomes. This provides the end-user with the insights on which fragments of the chemical structure have the highest (or the lowest) influence on the property of interest. This kind of artificial intuition is a valuable asset not only for the tasks of characterisation and design of novel compounds, but also to re-purpose already-known materials.
As for the future improvement on the SMILES-X, we plan to use BERT-like54 NA’s skeleton for the sake of reducing the accuracy gap existing between the ESOL, FreeSolv and Lipophilicity datasets studied here. The LSTM blocks are known to have memory problems with very distant dependencies within long sentences, and an architecture that is entirely based on the attention mechanism, i.e. free from LSTM blocks, like BERT, may overcome this weakness. Another way to improve the inference accuracy may be via informative sampling55.
In our forthcoming article we will address the tasks of classification, still using the MoleculeNet’s datasets26. That means that the SMILES-X will be modified in order to handle single-to-many, many-to-many and many-to-single classification tasks.
Conflicts of interest
There are no conflicts to declare.
Acknowledgements
The authors gratefully acknowledge the NVIDIA Corporation for the Titan V and Titan Xp GPUs, without which this research would not have been possible.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 11 Todeschini R and Consonni V (eds) 2008 Handbook of Molecular Descriptors (Weinheim: Wiley-VHC)
- 22 List of available descriptors in RD Kit URL https://www.rdkit.org/docs/Getting Started In Python.html#list-of-available-descriptors
- 33 Willett P, Barnard J M and Downs G M 1998 J. of Chem. Inf. Comput. Sci. 38 983–996
- 44 Cereto-Massagué A, Ojeda M J, Valls C, Mulero M, Garcia-Vallvé S and Pujadas G 2015 Methods 71 58–63
- 55 Mc Gregor M J and Pallai P V 1997 J. Chem. Inf. Comput. Sci. 37 443–448
- 66 Li H, Yap C, Ung C, Xue Y, Li Z, Han L, Lin H and Chen Y 2007 J. Pharm. Sci. 96 2838 – 2860 ISSN 0022-3549
- 77 Rogers D and Hahn M 2010 J. Chem. Inf. Model. 50 742–754
- 88 Coley C W, Barzilay R, Green W H, Jaakkola T S and Jensen K F 2017 J. Chem. Inf. Model. 57 1757–1772
