Deep learning for molecular design - a review of the state of the art

Daniel C. Elton; Zois Boukouvalas; Mark D. Fuge; Peter W. Chung

arXiv:1903.04388·cs.LG·May 24, 2019

Deep learning for molecular design - a review of the state of the art

Daniel C. Elton, Zois Boukouvalas, Mark D. Fuge, Peter W. Chung

PDF

TL;DR

This review surveys recent deep learning methods for molecular design, highlighting diverse models, representations, and key themes like reward functions and benchmarking, to advance autonomous molecule generation.

Contribution

It provides a comprehensive overview of the latest deep generative models for molecules, comparing techniques and identifying critical themes and challenges in the field.

Findings

01

Shift from SMILES to graph and 3D representations

02

Importance of reward function design in model performance

03

Need for standardized benchmarking protocols

Abstract

In the space of only a few years, deep generative modeling has revolutionized how we think of artificial creativity, yielding autonomous systems which produce original images, music, and text. Inspired by these successes, researchers are now applying deep generative modeling techniques to the generation and optimization of molecules - in our review we found 45 papers on the subject published in the past two years. These works point to a future where such systems will be used to generate lead molecules, greatly reducing resources spent downstream synthesizing and characterizing bad leads in the lab. In this review we survey the increasingly complex landscape of models and representation schemes that have been proposed. The four classes of techniques we describe are recursive neural networks, autoencoders, generative adversarial networks, and reinforcement learning. After first discussing…

Tables3

Table 1. Table 1: Different representation schemes

	method	unique?	invertible?
3D	raw voxels	✗	✓
	smoothed voxels	✗	✓
	tensor field networks	✗	✗
2D graph	SMILES	✗	✓
	canonical SMILES	✓	✓
	InChI	✓	✓
	MACCS keys	✓	✗
	tensors	✗	✓
	Chemception images	✓	✓
	fingerprinting	✓	✗

Table 2. Table 2: For works that trained models separately on multiple datasets we report only the largest dataset used. Several of these datasets are described in table 3 , which lists the major publicly available datasets. Other datasets are “HCEP”, the Harvard Clean Energy Project dataset of lead molecules for organic photovoltaic, “PSC”, a dataset of monomer repeat units for polymer solar cells, “MCF-7”, a database of anti-cancer molecules, and “L1000”, a database of molecules and gene expression profiles. Acronyms used are: AAE = adversarial autoencoder, ANC = adversarial neural computer, ATNC = adversarial threshold neural computer, BMI = Bayesian model inversion, CAAE = constrained AAE, CCM-AAE = Constant-curvature Riemannian manifold AAE, CFG = context free grammar, CVAE = constrained VAE, ECC = edge-conditioned graph convolutions, 70 GAN = generative adversarial network, GCPN = graph convolutional policy network, GVAE = grammar VAE, JT-VAE = junction tree VAE, MHG = molecular hypergraph grammar, MPNN = Message Passing Neural Net, RG = reduced graph, RNN = recurrent neural network, sGAN = stacked GAN, SD-VAE = syntax-directed VAE, SSVAE = semi-supervised VAE, VAE = variational autoencoder ∗ filtered to isolate likely bioactive compounds.

architecture	representation	$N_{train}$	dataset(s)	citation(s)
RNN	SMILES	1,611,889	ZINC	Bjerrum, 2017 ⁷¹
RNN	SMILES	541,555	ChEMBL^∗	Gupta, 2017 ⁷²
RNN	SMILES	350,419	DRD2	Oliverona, 2017 ⁷³
RNN	SMILES	1,400,000	ChEMBL	Segler, 2017 ⁷⁴
RNN	SMILES	250,000	ZINC	Yang, 2017 ⁷⁵
RNN	SMILES	200,000	ZINC	Cherti, 2017 ⁷⁶
RNN	SMILES	1,735,442	ChEMBL	Neil, 2018 ⁷⁷
RNN	SMILES	1,500,000	ChEMBL	Popova, 2018 ⁷⁸
RNN	SMILES	13,000	PubChemQC	Sumita, 2018 ⁷⁹
RNN	SMILES	541,555	ChEMBL^∗	Merk, 2018 ⁸⁰
RNN	SMILES	541,555	ChEMBL^∗	Merk, 2018 ⁸¹
RNN	SMILES	509,000	ChEMBL	Ertl, 2018 ⁸²
RNN	SMILES	1,000,000	GDB-13	Arús-Pous, 2018 ⁸³
RNN	SMILES	163,000	ZINC	Zheng, 2019 ⁸⁴
RNN	RG+SMILES	798,243	ChEMBL	Pogány, 2018 ⁸⁵
RNN	graph operations	130,830	ChEMBL	Li, 2018 ⁵⁸
VAE	SMILES	249,000	ZINC/QM9	Gómez-Bombarelli, 2016 ²³
VAE	SMILES	1,200,000	ChEMBL	Blaschke, 2018 ⁸⁶
VAE	SMILES	500,000	ZINC	Lim, 2018 ⁸⁷
VAE	SMILES	300,000	ZINC	Kang, 2018 ⁸⁸
VAE	SMILES	190,000	ZINC	Harel, 2018 ⁸⁹
VAE	SMILES	1,211,352	ChEMBL23	Sattarov, 2019 ⁹⁰
GVAE	CFG (SMILES)	200,000	ZINC	Kusner, 2017 ⁹¹
GVAE	CFG (custom)	3,989	PSC	Jørgensen, 2018 ^{26, 92}
SD-VAE	CFG (custom)	250,000	ZINC	Dai, 2018 ⁵⁷
JT-VAE	graph operations	250,000	ZINC	Jin, 2018 ⁹³
JT-VAE	graph operations	250,000	ZINC	Jin, 2019 ⁹⁴
CVAE	graph	250,000	ZINC/CEPDB	Liu, 2018 ⁹⁵
MHG-VAE	graph (MHG)	220,011	ZINC	Kajino, 2018 ⁹⁶
VAE	graph	72,000,000	ZINC+PubChem	Winter, 2018 ⁹⁷
VAE	graph	10,000	ZINC/QM9	Samanta, 2018 ⁹⁸
VAE	graph (tensors)	10,000	ZINC	Samanta, 2018 ⁹⁹
VAE	graph (tensors)	250,000	ZINC/QM9	Simonovsky, 2018 ⁶³
CVAE	graph (tensors)	250,000	ZINC	Ma, 2018 ¹⁰⁰
VAE	3D wave transform	4,8000,000	ZINC	Kuzminkykh, 2018 ³⁹
CVAE	3D density	192,813,983	ZINC	Skalic, 2019 ⁴⁰
VAE+RL	MPNN+graph ops	133,885	QM9	Kearns, 2019¹⁰¹
GAN	SMILES	5,000	GBD-17	Guimaraes, 2017 ¹⁰²
GAN (ANC)	SMILES	15,000	ZINC/CHEMDIV	Putin, 2018¹⁰³
GAN (ATNC)	SMILES	15,000	ZINC/CHEMDIV	Putin, 2018¹⁰⁴
GAN	graph (tensors)	133,885	QM9	De Cao, 2018^{62, 49}
GAN	MACCS (166bit)	6,252	MCF-7	Kadurin, 2017 ⁶⁹
sGAN	MACCS (166bit)	20,000	L1000	Méndez-Lucio, 2017 ¹⁰⁵
CycleGAN	graph operations	250,000	ZINC	Maziarka, 2019 ¹⁰⁶
AAE	MACCS (166bit)	6,252	MCF-7	Kadurin, 2017 ⁶⁸
AAE	SMILES	15,000	HCEP	Sanchez-Lengeling, 2017 ¹⁰⁷
CCM-AAE	graph (tensors)	133,885	QM9	Grattarola, 2018 ¹⁰⁸
BMI	SMILES	16,674	PubChem	Ikebata, 2017 ¹⁰⁹
CAAE	SMILES	1,800,000	ZINC	Polykovskiy, 2018 ¹¹⁰
GCPN	graph	250,000	ZINC	You, 2018 ⁶⁴
pure RL	graph	n/a	n/a	Zhou, 2018 ⁶⁵
pure RL	fragments	n/a	n/a	Ståhl, 2019 ¹¹¹

Table 3. Table 3: Some publicly available datasets. ∗ also contains numerous conformers for each molecule, for a total of 4,855 structures

dataset	description	$N$	URL / citation
GDB-13	Combinatorially generated library.	977,468,314	http://gdb.unibe.ch/downloads/¹¹²
ZINC15	Commercially available compounds.	$>$ 750,000,000	http://zinc15.docking.org¹¹³
GDB-17	Combinatorially generated library.	50,000,000	http://gdb.unibe.ch/downloads/¹¹⁴
eMolecules	Commercially available compounds.	18,000,000	https://reaxys.emolecules.com/
SureChEMBL	Compounds obtained from chemical patents.	17,000,000	https://www.surechembl.org/search/
PubChemQC	Compounds from PubChem with property data from quantum chemistry (DFT) calculations.	3,981,230	http://pubchemqc.riken.jp/¹¹⁵
ChEMBL	A curated database of bioactive molecules.	2,000,000	https://www.ebi.ac.uk/chembl/
SuperNatural	A curated database of natural products.	2,000,000	http://bioinformatics.charite.de/supernatural/
QM9	Stable small CHONHF organic molecules taken from GDB-17 with properties calculated from ab initio density functional theory.	133,885	http://quantum-machine.org/datasets/¹¹⁶
BNPAH	B, N-substituted polycyclic aromatic hydrocarbons with properties calculated from ab initio density functional theory.	33,000	https://moldis.tifrh.res.in/datasets.html¹¹⁷
DrugBank	FDA drugs and other drugs available internationally.	10,500	https://www.drugbank.ca/
Energetics	Energetic molecules and simulation data collected from public domain literature.	417	https://git.io/energeticmols²⁷
HOPV15	Harvard Organic Photovoltaic Dataset	350^∗	https://figshare.com/articles /HOPV15_Dataset/1610063/4¹¹⁸

Equations76

L^{\mbox M L E} = - s \in Z \sum t = 2 \sum T lo g π_{θ} (s_{t} ∣ S_{1 : t - 1})

L^{\mbox M L E} = - s \in Z \sum t = 2 \sum T lo g π_{θ} (s_{t} ∣ S_{1 : t - 1})

p_{i}^{\mbox n e w} = \frac{exp ( \frac{p _{i}}{T} )}{\sum _{i} exp ( \frac{p _{i}}{T} )}

p_{i}^{\mbox n e w} = \frac{exp ( \frac{p _{i}}{T} )}{\sum _{i} exp ( \frac{p _{i}}{T} )}

p_{i}^{\mbox n e w} = \frac{p _{i}^{\frac{1}{T}}}{\sum _{i} p _{i}^{\frac{1}{T}}}

p_{i}^{\mbox n e w} = \frac{p _{i}^{\frac{1}{T}}}{\sum _{i} p _{i}^{\frac{1}{T}}}

θ max J (θ) = E [G_{T} ∣ s_{0}, θ]

θ max J (θ) = E [G_{T} ∣ s_{0}, θ]

J (θ) = R_{T} t = 0 \sum T π_{θ} (a_{t} ∣ s_{t})

J (θ) = R_{T} t = 0 \sum T π_{θ} (a_{t} ∣ s_{t})

π_{θ} (a_{t} ∣ s_{t}) = π_{θ} (s_{t + 1} ∣ S_{1 : t})

π_{θ} (a_{t} ∣ s_{t}) = π_{θ} (s_{t + 1} ∣ S_{1 : t})

\nabla J (θ) = E [G_{t} \frac{\nabla _{θ} π _{θ} ( a _{t} ∣ y _{1 : t - 1} )}{π _{θ} ( a _{t} ∣ y _{1 : t - 1} )}]

\nabla J (θ) = E [G_{t} \frac{\nabla _{θ} π _{θ} ( a _{t} ∣ y _{1 : t - 1} )}{π _{θ} ( a _{t} ∣ y _{1 : t - 1} )}]

R^{'} (S) = [σ R (S) + lo g P_{\mbox p r i or} (S) - lo g P_{\mbox c u r r e n t} (S)]^{2}

R^{'} (S) = [σ R (S) + lo g P_{\mbox p r i or} (S) - lo g P_{\mbox c u r r e n t} (S)]^{2}

lo g p (x) \geq E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( x ∣ z ) p ( z )}{q _{ϕ} ( z ∣ x )}]

lo g p (x) \geq E_{z \sim q_{ϕ} (z ∣ x)} [lo g \frac{p _{θ} ( x ∣ z ) p ( z )}{q _{ϕ} ( z ∣ x )}]

L_{θ, ϕ} (x) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x)] - D_{\mbox K L} (q_{ϕ} (z ∣ x), p_{θ} (z ∣ x))

L_{θ, ϕ} (x) = E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x)] - D_{\mbox K L} (q_{ϕ} (z ∣ x), p_{θ} (z ∣ x))

D_{\mbox K L} (q (z), p (z)) \equiv \int q (z) lo g \frac{q ( z )}{p ( z )} d z

D_{\mbox K L} (q (z), p (z)) \equiv \int q (z) lo g \frac{q ( z )}{p ( z )} d z

L_{θ, ϕ} (x)

L_{θ, ϕ} (x)

= E_{z \sim q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - D_{\mbox K L} [q_{ϕ} (z ∣ x), p (z)]

L_{θ, ϕ} = x \in Z \sum - L_{θ, ϕ} (x)

L_{θ, ϕ} = x \in Z \sum - L_{θ, ϕ} (x)

p (z)

p (z)

q_{ϕ} (z ∣ x)

L_{\mbox a d v} = E_{x \sim p_{d}} [lo g D ((q_{Θ} (z ∣ x))] + E_{x \sim p_{z}} [lo g (1 - D (z))]

L_{\mbox a d v} = E_{x \sim p_{d}} [lo g D ((q_{Θ} (z ∣ x))] + E_{x \sim p_{z}} [lo g (1 - D (z))]

L_{θ, ϕ} = x \in Z \sum - E_{x \sim p_{d}} [lo g p_{θ} (x ∣ q_{ϕ} (z ∣ x))] - L_{\mbox a d v}

L_{θ, ϕ} = x \in Z \sum - E_{x \sim p_{d}} [lo g p_{θ} (x ∣ q_{ϕ} (z ∣ x))] - L_{\mbox a d v}

E_{z \sim q_{ϕ} (z ∣ x, y)} [lo g p_{θ} (x ∣ y, z)] - D_{\mbox K L} [q_{ϕ} (z ∣ x, y) ∣∣ p (z, y)]

E_{z \sim q_{ϕ} (z ∣ x, y)} [lo g p_{θ} (x ∣ y, z)] - D_{\mbox K L} [q_{ϕ} (z ∣ x, y) ∣∣ p (z, y)]

\hat{x} = x arg max p_{θ} (x ∣ y, z)

\hat{x} = x arg max p_{θ} (x ∣ y, z)

G min D max V (D, G) =

G min D max V (D, G) =

+ E_{z \in p_{z} (z)} [lo g (1 - D (G (z))]

D_{G}^{*} (x) = \frac{p _{d} ( x )}{p _{d} ( x ) + p _{θ_{G}} ( x )}

D_{G}^{*} (x) = \frac{p _{d} ( x )}{p _{d} ( x ) + p _{θ_{G}} ( x )}

C (G) = - lo g (4) + 2 D_{\mbox J S} (p_{d}, p_{θ_{G}})

C (G) = - lo g (4) + 2 D_{\mbox J S} (p_{d}, p_{θ_{G}})

D_{\mbox J S} (p_{d}, p_{θ_{G}}) =

D_{\mbox J S} (p_{d}, p_{θ_{G}}) =

\displaystyle+\frac{1}{2}D_{\mbox{{KL}}}\left(p_{\theta_{G}}\bigg{|}\bigg{|}\frac{p_{d}+p_{\theta_{G}}}{2}\right)

D_{\mbox K L} (p_{d}, p_{θ_{G}}) = \int p_{d} (z) (lo g p_{d} (z) - lo g p_{θ_{G}} (z)) d z

D_{\mbox K L} (p_{d}, p_{θ_{G}}) = \int p_{d} (z) (lo g p_{d} (z) - lo g p_{θ_{G}} (z)) d z

W (p, q) = γ \in Π (p, q) in f E_{(x, y) \in γ} ∣∣ x - y ∣∣

W (p, q) = γ \in Π (p, q) in f E_{(x, y) \in γ} ∣∣ x - y ∣∣

R (S) = λ D (S) + (1 - λ)) R (S)

R (S) = λ D (S) + (1 - λ)) R (S)

r_{\mbox d i v er s i t y} = 1 - \frac{1}{∣ G ∣} (x_{1}, x_{2}) \in G \times G \sum D (x_{1}, x_{2})

r_{\mbox d i v er s i t y} = 1 - \frac{1}{∣ G ∣} (x_{1}, x_{2}) \in G \times G \sum D (x_{1}, x_{2})

r_{\mbox S S N} = \frac{1}{∣ G ∣} x_{G} \in G \sum x_{D} \in D max D (x_{G}, x_{G})

r_{\mbox S S N} = \frac{1}{∣ G ∣} x_{G} \in G \sum x_{D} \in D max D (x_{G}, x_{G})

R_{\mbox d i v er s i t y mi s ma t c h} = - r_{\mbox d i v er s i t y}^{\mbox g e n er a t e d} - r_{\mbox d i v er s i t y}^{\mbox t r ainin g}

R_{\mbox d i v er s i t y mi s ma t c h} = - r_{\mbox d i v er s i t y}^{\mbox g e n er a t e d} - r_{\mbox d i v er s i t y}^{\mbox t r ainin g}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\makeFNbottom

[TABLE]

††footnotetext: a Department of Mechanical Engineering, University of Maryland, College Park, Maryland, 20740, United States of America. E-mail:[email protected]††footnotetext: b Department of Mathematics and Statistics, American University, Washington, D.C., 20016, United States of America.††footnotetext: † Present address: National Institutes of Health Clinical Center, Bethesda, Maryland, United States of America.

The average cost to bring a new drug to market is now well over one billion USD,1 with an average time from discovery to market of 13 years.2 Outside of pharmaceuticals the average time from discovery to commercial production can be even longer, for instance for energetic molecules it is 25 years.3 A critical first step in molecular discovery is generating a pool of candidates for computational study or synthesis and characterization. This is a daunting task because the space of possible molecules is enormous—the number of potential drug-like compounds has been estimated to be between $10^{23}$ and $10^{60}$ ,4 while the number of all compounds that have been synthesized is on the order of $10^{8}$ . Heuristics, such as Lipinski’s “rule of five” for pharmaceuticals5 can help narrow the space of possibilities, but the task remains daunting. High throughput screening (HTS)6 and high throughput virtual screening (HTVS)7 techniques have made larger parts of chemical space accessible to computational and experimental study. Machine learning has been shown to be capable of yielding rapid and accurate property predictions for many properties of interest and is being integrated into screening pipelines, since it is orders of magnitude faster than traditional computational chemistry methods.8 Techniques for the interpretation and “inversion” of a machine learning model can illuminate structure-property relations that have been learned by the model which can in turn be used to guide the design of new lead molecules.9, 10 However even with these new techniques bad leads still waste limited supercomputer and laboratory resources, so minimizing the number of bad leads generated at the start of the pipeline remains a key priority. The focus of this review is on the use of deep learning techniques for the targeted generation of molecules and guided exploration of chemical space. We note that machine learning (and more broadly artificial intelligence) is having an impact on accelerating other parts of the chemical discovery pipeline as well, via machine learning accelerated ab-initio simulation,8 machine learning based reaction prediction,11, 12 deep learning based synthesis planning,13 and the development of high-throughput “self-driving” robotic laboratories.14, 15

Deep neural networks, which are often defined as networks with more than three layers, have been around for many decades but until recently were difficult to train and fell behind other techniques for classification and regression. By most accounts, the deep learning revolution in machine learning began in 2012, when deep neural network based models began to win several different competitions for the first time. First came a demonstration by Cireşan et al. of how deep neural networks could achieve near-human performance on the task of handwritten digit classification.16 Next came groundbreaking work by Krizhevsky et al. which showed how deep convolutional networks achieved superior performance on the 2010 ImageNet image classification challenge.17 Finally, around the same time in 2012, a multitask neural network developed by Dahl et al. won the “Merck Molecular Activity Challenge" to predict the molecular activities of molecules at 15 different sites in the body, beating out more traditional machine learning approaches such as boosted decision trees.18 One of the key technical advances published that year and used by both Krizhevsky et al. and Dahl et al. was a novel regularization trick called “dropout”.19 Another important technical advance was the efficient implementation of neural network training on graphics processing units (GPUs). By 2015 better hardware, deeper networks, and a variety of further technical advances had reduced error rates on the ImageNet challenge by a factor of 3 compared to the Krizhevsky’s 2012 result.20

In addition to the tasks of classification and regression, deep neural networks began to be used for generation of images, audio, and text, giving birth to the field of “deep generative modeling”. Two key technical advances in deep generative modeling were the variational autoencoder (Kimga et al., 201321) and generative adversarial networks (Goodfellow et al. 201422). The first work demonstrating deep generative modeling of molecules was the “molecular autoencoder” work of Gómez-Bombarelli et al. which appeared on the arXiv in October 2016 and was published in ACS Central Science in 2018.23 Since then, there has been an explosion of advancements in deep generative modeling of molecules using several different deep learning architectures and many variations thereof, as shown in table 2. In addition to new architectures, new representation schemes, many of which are graph based, have been introduced as alternatives to the SMILES representation used by Gómez-Bombarelli et al. The growing complexity of the landscape of architectures and representations and the lack of agreement upon standards for benchmarking and comparing different approaches has prompted us to write this review.

While much of the work so far has focused on deep generative modeling for drug molecules,24 there are many other application domains which are benefiting from the application of deep learning to lead generation and screening, such as organic light emitting diodes,25 organic solar cells,26 energetic materials,27, 10 electrochromic devices,28 polymers,29 polypeptides,30, 31, 32 and metal organic frameworks.33, 34

Our review touches on four major issues we have observed in the field. The first is the importance and opportunities for improvement by using different molecular representations. Recent efforts have begun to depart from the use of Simplified Molecular-Input Line-Entry System (SMILES) strings towards representations that are “closer to the chemical structure” and offer improved chemical accuracy, such as graph grammar based methods. The second issue is architecture selection. We discuss the pros and cons underlying different choices of model architecture and present some of their key mathematical details to better illuminate how different approaches relate to each other. This leads us to highlight the advantages of adversarial training and reinforcement learning over maximum likelihood based training. We also touch on techniques for molecular optimization using generative models, which has grown in popularity recently. The third major issue is the approaches for quantitatively evaluating different approaches for molecular generation and optimization. Fourth, and finally, we discuss is reward function design, which is crucial for the practical application of methods which use reinforcement learning. We contribute by offering novel overview of how to engineer reward function to generate a set of leads which is chemically stable, diverse, novel, has good properties, and is synthesizable.

There are reasons to be skeptical about whether today’s deep generative models can outperform traditional computational approaches to lead generation and optimization. Traditional approaches are fundamentally combinatorial in nature and involve mixing scaffolds, functional groups, and fragments known to be relevant to the problem at hand (for a review, see Pirard et al.35). A naive combinatorial approach to molecular generation leads to most molecules being unstable or impossible to synthesize, so details about chemical bonding generally must be incorporated. One approach is to have an algorithm perform virtual chemical reactions, either from a list of known reactions, or using ab initio methods for reaction prediction.36 Another popular approach is to use genetic algorithms with custom transformation rules which are known to maintain chemical stability.37 One of the latest genetic algorithm based approaches (“Grammatical Evolution”) can match the performance of the deep learning approaches for molecular optimization under some metrics.38 Deep generative modeling of molecules has made rapid progress in just a few years and there are reasons to expect this progress to continue, not just with better hardware and data, but due to new architectures and approaches. For instance, generative adversarial networks and deep reinforcement learning (which may be combined or used separately) have both seen technical advancements recently.

1 Molecular representation
1.1 Representations of 3D geometry
1.2 Representations of molecular graphs
1.2.1 SMILES and string-based representations
1.2.2 Image-based representations
1.2.3 Tensor representations
1.2.4 Other graph-based representations
2 Deep learning architectures
2.1 Recurrent neural networks (RNNs)
2.1.1 Optimization with RNNs using reinforcement learning
2.2 Autoencoders
2.2.1 Variational autoencoders (VAEs)
2.2.2 Adversarial autoencoders (AAEs)
2.2.3 Supervised VAEs/AAEs for property prediction & optimization
2.3 Generative adversarial networks (GANs)
2.3.1 The perfect discriminator problem and training instabilities
3 Metrics and reward functions
3.1 Metrics for scoring generative models
3.2 Reward function design
3.2.1 Diversity & novelty
3.2.2 Stability and synthesizability
3.2.3 Rewards for good properties
4 Prospective and future directions

1 Molecular representation

The molecular representation refers to the digital encoding used for each molecule that serves as input for training the deep learning model. A representation scheme must capture essential structural information about each molecule. Creating an appropriate representation from a molecular structure is called featurization. Two important properties that are desirable (but not required) for representations are uniqueness and invertibility. Uniqueness means that each molecular structure is associated with a single representation. Invertibility means that each representation is associated with a single molecule (a one-to-one mapping). Most representations used for molecular generation are invertible, but many are non-unique. There are many reasons for non-uniqueness, including the representation not being invariant to the underlying physical symmetries of rotation, translation, and permutation of atomic indexes.

Another factor one should consider when choosing a representation is the whether it is a character sequence or tensor. Some methods only work with sequences, while others only work with tensor. Sequences may be converted into tensors using one-hot encoding. Another choice is whether to use a representation based on the 3D coordinates of the molecule or a representation based on the 2D connectivity graph. Molecules are fundamentally three dimensional quantum mechanical objects, typically visualized as consisting of nuclei with well-defined positions surrounded by many electrons which are described by a complex-valued wavefunction. Fundamentally, all properties of a molecule can be predicted using quantum mechanics given only the relative coordinates of the nuclei and the type and ionization state of each atom. However, for many applications, working with 3D representations is cumbersome and unnecessary. In this section, we review both 3D and 2D representation schemes that have been developed recently.

1.1 Representations of 3D geometry

Trying to implement machine learning directly with nuclear coordinates introduces a number of issues. The main issue is that coordinates are not invariant to molecular translation, rotation, and permutation of atomic indexing. While machine learning directly on coordinates is possible, it is much better to remove invariances to create a more compact representation (by removing degrees of freedom) and develop a scheme to obtain a unique representation for each molecule. One approach that uses 3D coordinates uses a 3D grid of voxels and specifies the nuclear charge contained within each voxel, thus creating a consistent representation. Nuclear charge (i.e. atom type) is typically specified by a one-hot vector of dimension equal to the number of atom types in the dataset. This scheme leads to a very high dimensional sparse representation, since the vast majority of voxels will not contain a nuclear charge. While sparse representations are considered desirable in some contexts, here sparsity leads to very large training datasets. This issue can be mitigated via spatial smoothing (blurring) by placing spherical Gaussians or a set of decaying concentric waves around each atomic nuclei.39 Alternatively, the van der Waals radius may be used.40 Amidi et al. use this type of approach for predictive modeling with 3D convolutional neural networks (CNNs),41 while Kuzminykh et al. and Skalic et al. use this approach with a CNN-based autoencoder for generative modeling.39, 40

Besides high dimensionality and sparsity, another issue with 3D voxelized representations is they do not capture invariances to translation, rotation, and reflection, which are hard for present-day deep learning based architectures to learn. Capturing such invariances is important for property prediction, since properties are invariant to such transformations. It is also important for creating compact representations of molecules for generative modeling. One way to deal with such issues is to always align the molecular structure along a principal axis as determined by principle component analysis to ensure a unique orientation.41, 39 Approaches which generate feature vectors from 3D coordinates that are invariant to translation and rotation are wavelet transform invariants,42 solid harmonic wavelet scattering transforms,43 and tensor field networks.44 All of these methods incur a loss of information about 3D structure and are not easy to invert, so their utility for generative modeling may be limited (deep learning models learn to generate these representations, but if they cannot be unambiguously related to a 3D structure they are not very useful). Despite their issues with invertibility, tensor field networks have been suggested to have utility for generative modeling since it was shown they can accurately predict the location of missing atoms in molecules where one atom was removed.44 We expect future work on 3D may proceed in the direction of developing invertible representations that are based on the internal (relative) coordinates of the molecule.

1.2 Representations of molecular graphs

1.2.1 SMILES and string-based representations

Most generative modeling so far has not been done with coordinates but instead has worked with molecular graphs. A molecule can be considered as an undirected graph $\mathcal{G}$ with a set of edges $\mathcal{E}$ and set of vertices $\mathcal{V}$ . The obvious disadvantage of such graphs is that information about bond lengths and 3D conformation is lost. For some properties one may wish to predict, the specific details of a molecule’s 3D conformations may be important. For instance, when packing in a crystal or binding to a receptor, molecules will find the most energetically favorable conformation, and details of geometry often have a big effect. Despite this, graph representations have been remarkably successful for a variety of generative modeling and property prediction tasks. If a 3D structure is desired from a graph representation, molecular graphs can be embedded in 3D using distance geometry methods (for instance as implemented in the OpenBabel toolkit45, 46). After coordinate embedding, the most energetically favorable conformation of the molecule can be obtained by doing energy minimization with classical forcefields or quantum mechanical simulation.

There are several ways to represent graphs for machine learning. The most popular way is the SMILES string representation.47 SMILES strings are a non-unique representation which encode the molecular graph into a sequence of ASCII characters using a depth-first graph traversal. SMILES are typically first converted into a one-hot based representation. Generative models then produce a categorical distribution for each element, often with a softmax function, which is sampled. Since standard multinomial sampling procedures are non-differentiable, sampling can be avoided during training or a Gumbel-softmax can be used.48, 49

Many deep generative modeling techniques have been developed specifically for sequence generation, most notably Recurrent Neural Networks (RNNs), which can be used for SMILES generation. The non-uniqueness of SMILES arises from a fundamental ambiguity about which atom to start the SMILES string construction on, which means that every molecule with $N$ heavy (non-hydrogen) atoms can have at least $N$ equivalent SMILES string representations. There is additional non-uniqueness due to different conventions on whether to include charge information in resonance structures such as nitro groups and azides. The MolVS50 or RDKit51 cheminformatics packages can be used to standardize SMILES, putting them in a canonical form. However, Bjerrum et al. have pointed out that the latent representations obtained from canonical SMILES may be less useful because they become more related to specific grammar rules of canonical SMILES rather than the chemical structure of the underlying molecule.52 This is considered an issue for interpretation and optimization since it is better if latent spaces encode underlying chemical properties and capture notions of chemical similarity rather than SMILES syntax rules. Bjerrum et al. have suggested SMILES enumeration (training on all SMILES representations of each molecule), rather than using canonical SMILES, as a better solution to the non-uniqueness issue.52, 53 An approach similar to SMILES enumeration is used in computer vision applications to obtain rotational invariance—image datasets are often “augmented” by including many rotated versions of each image. Another approach to obtain better latent representations explored by Bjerrum et al. is to input both enumerated SMILES and Chemception-like image arrays (discussed below) into a single “heteroencoder” framework.52

In addition to SMILES strings, Gómez-Bombarelli et al. have tried InChI strings54 with their variational autoencoder, but found they led to inferior performance in terms of the decoding rate and the subjective appearance of the molecules generated. Interestingly, Winter et al. show that more physically meaningful latent spaces can be obtained by training a variational autoencoder to translate between InChI to SMILES.55 There is an intuitive explanation for this—the model must learn to extract the underlying chemical structures which are encoded in different ways by the two representations.

SMILES based methods often struggle to achieve a high percentage of valid SMILES. As a possible solution to this, Kusner et al. proposed decomposing SMILES into a sequence of rules from a context free grammar (CFG).56 The rules of the context-free grammar impose constraints based on the grammar of SMILES strings.57 Because the construction of SMILES remains probabilistic, the rate of valid SMILES generation remains below 100%, even when CFGs are employed and additional semantic constraints are added on top.57 Despite the issues inherent with using SMILES, we expect it will continue to a popular representation since most datasets store molecular graphs using SMILES as the native format, and since architectures developed for sequence generation (i.e. for natural language or music) can be readily adopted. Looking longer term, we expect a shift towards methods which work directly with the graph and construct molecules according to elementary operations which maintain chemical valence rules.

Li et al. have developed a conditional graph generation procedure which obtains a very high rate of valid chemical graphs (91%) but lower negative log likelihood scores compared to a traditional SMILES based RNN model.58 Another more recent work by Li et al. uses a deep neural network to decide on graph generation steps (append, connect, or terminate).59 Efficient algorithms for graph and tree enumeration have been previously developed in a more pure computer science context. Recent work has looked at how such techniques can be used for molecular graph generation,60 and likely will have utility for deep generative models as well.

1.2.2 Image-based representations

Most small molecules are easily represented as 2D images (with some notable exceptions like cubane). Inspired by the success of Google’s Inception-ResNet deep convolutional neural network (CNN) architecture for image recognition, Goh et al. developed “Chemception”, a deep CNN which predicts molecular properties using custom-generated images of the molecular graph.61 The Chemception framework takes a SMILES string in and produces an 80x80 greyscale image which is actually an array of integers, where empty space is ‘0’, bonds are ‘2’ and atoms are represented by their atomic number.61 Bjerrum et al. extend this idea, producing “images” with five color channels which encode a variety of molecular features, some which have been compressed to few dimensions using PCA.52

1.2.3 Tensor representations

Another approach to storing the molecular graph is to store the vertex type (atom type), edge type (bond type), and connectivity information in multidimensional arrays (tensors). In the approach used by de Cao & Kipf,49, 62 each atom is a vertex $v_{i}\in\mathcal{V}$ which may be represented by a one-hot vector $\boldsymbol{x}_{i}\in\{0,1\}^{|\mathcal{A}|}$ which indicates the atom type, out of $|\mathcal{A}|$ possible atom types. Each bond is represented by an edge $(v_{i},v_{j})$ which is associated with a one-hot vector $\boldsymbol{y}_{i}\in\{0,1\}^{Y}$ representing the type of bond out of $Y$ possible bond types. The vertex and edge information can be stored in a vertex feature matrix $X=[\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{N}]^{T}\in\mathbb{R}^{N\times|\mathcal{A}|}$ and an adjacency tensor $A\in\mathbb{R}^{N\times N\times Y}$ where $A_{ij}\in\mathbb{R}^{Y}$ . Simonovsky et al.63 use a similar approach—they take a vertex feature matrix $X$ and concatenate the adjacency tensor $A$ with a traditional adjacency matrix where connections are indicated by a ‘1’. As with SMILES, adjacency matrices suffer from non-uniqueness—for a molecule with $N$ atoms, there are $N!$ equivalent adjacency matrices representing the same molecular graph, each corresponding to a different re-ordering of the atoms/nodes. This makes it challenging to compute objective functions, which require checking if two adjacency matrix representations correspond to the same underlying graph (the “graph isomorphism ” problem, which takes $N^{4}$ operations in the worse case). Simonovsky et al. use an approximate graph matching algorithm to do this, but it is still computationally expensive.

1.2.4 Other graph-based representations

Another approach is to train an RNN or reinforcement learning agent to operate directly on the molecular graph, adding new atoms and bonds in each action step from a list of predefined possible actions. This approach is taken with the graph convolutional policy network64 and in recent work using pure deep reinforcement learning to generate molecules.65 Because these methods work directly on molecular graphs with rules which ensure that basic atom valence is satisfied, they generate 100% chemically valid molecules.

Finally, when limited to small datasets one may elect to do generative modeling with compact feature vectors based on fingerprinting methods or descriptors. There are many choices (Coulomb matrices, bag of bonds, sum over bonds, descriptor sets, graph convolutional fingerprints, etc.) which we have previously tested for regression,27, 66 but they are generally not invertible unless a very large database with a look-up table has been constructed. (In this context, by invertible we mean the complete molecular graph can be reconstructed without loss.) As an example of how it may be done, Kadurin et al. use 166 bit Molecular ACCess System (MACCS) keys67 for molecular representation with adversarial autoencoders.68, 69 In MACCS keys, also called MACCS fingerprints, each bit is associated with a specific structural pattern or question about structure. To associate molecules to MACCS keys one must search for molecules with similar or identical MACCS keys in a large chemical database. Fortunately several large online chemical databases have application programming interfaces (APIs) which allow for MACCS-based queries, for instance PubChem, which contains 72 million compounds.

2 Deep learning architectures

In this section we summarize the mathematical foundations of several popular deep learning architectures and expose some of their pros and cons. A basic familiarity with machine learning concepts is assumed.

2.1 Recurrent neural networks (RNNs)

We discuss recurrent neural network sequence models first because they are fundamental to molecular generation—most VAE and GAN implementations include an RNN for sequence generation. In what follows, a sequence of length $T$ will be denoted as $S_{1:T}=(s_{1},\cdots,s_{T}),s_{t}\in\mathcal{V}$ , where $\mathcal{V}$ is the set of tokens, also called the vocabulary. For the purpose of this section we assume the sequences in question are SMILES strings, as they are by far the most widely used. As discussed previously in the context of SMILES the “tokens” are the different characters which are used to specify atom types, bond types, parentheses, and the start and stop points of rings. The first step in sequence modeling is typically one-hot encoding of the sequence’s tokens, in which each token is represented as a unique $N$ dimensional vector where one element is $1$ and the rest are [math] (where $N$ is the number of tokens in the vocabulary).

Recurrent neural networks (RNNs) are the most popular models for sequence modeling and generation. We will not go into detail of their architecture, since it is well described elsewhere.119, 120 An important detail to note however is that the type of RNN unit that is typically used for generating molecules is either the long short term memory (LSTM) unit,121 or a newer more computationally efficient variant called the gated recurrent unit (GRU).122 Both LSTMs and GRUs contain a memory cell which alleviates the exploding and vanishing gradient problems that can occur when training RNNs to predict long-term dependencies.121, 122

Sequence models are often trained to predict just a single missing token in a sequence, as trying to predict more than one token leads to a combinatorial explosion of possibilities. Any machine learning model trained to predict the next character in an input sequence can be run in “generative mode" or “autoregressive mode" by concatenating the predicted token to the input sequence and feeding the new sequence back into the model. However, this type of autoregressive generation scheme typically fails because the model was trained to predict on the data distribution and not its own generative distribution, and therefore each prediction contains at least a small error. As the network is run recursively, these errors rapidly compound, leading to rapid degradation in the quality of the generated sequences. This problem is known as “exposure bias”.123 The Data as Demonstrator (DaD) algorithm tries to solve the problem of exposure bias by running a model recursively during training and comparing the output to the training data during training.124 DaD was extended to sequence generation with RNNs by Bengio et al., who called the method “scheduled sampling".125 While research continues in this direction, issues have been raised about the lack of a firm mathematical foundation for such techniques, with some suggesting they do not properly approximate maximum likelihood.126

Better generative models can be obtained by training using maximum likelihood maximization on the sequence space rather than next-character prediction. In maximum likelihood training a model $\pi_{\theta}$ parametrized by $\theta$ is trained with the following differentiable loss:

[TABLE]

Here $\mathcal{Z}$ is the set of training sequences which are assumed to each be of length $T$ . This expression is proportional to the negative cross entropy of the model distribution and the training data distribution (maximizing likelihood is equivalent to minimizing cross entropy). MLE training can be done with standard gradient descent techniques and backpropagation through time to compute the gradient of the loss. In practice though this type of training fails to generate valid SMILES, likely because of strict long term dependencies such as closing parentheses and rings. The “teacher forcing” training procedure127 is an important ingredient which was found to be necessary to include in the molecular autoencoder VAE to capture such long term dependencies—otherwise the generation rate of valid SMILES was found to be near 0%.128 In teacher forcing, instead of sampling from the model’s character distribution to get the next character, the right character is given directly to the model during training.120

In the context of SMILES strings generation, to generate each character the output layer usually gives probabilities $p_{i}$ for every possible SMILES string token. When running in generative mode, the typical method is to use a multinomial sampler to sample this distribution, while in training mode one typically just chooses the token with the highest probability. Using a multinomial sampler captures the model’s true distribution, but because MLE training tends to focus on optimizing the peaks of the distribution and doesn’t always capture the tails of distributions well. So called “thermal” rescaling can be used to sample further away from the peaks of the distribution by rescaling the probabilities as:

[TABLE]

where $T$ is a sampling temperature. Alternatively, if a softmax layer is used to generate the final output of a neural network, a temperature parameter can be built directly into it. Another alternative is the “freezing function”:

[TABLE]

Sampling at low $T$ leads to the generation of molecules which are only slight variations on molecules seen in the training data. Generation at high $T$ leads to greater diversity but also higher probability of nonsensical results.

2.1.1 Optimization with RNNs using reinforcement learning

Neil et al introduced a simple method for repeated MLE which biases generation towards molecules with good properties, which they call HillClimb-MLE.77 Starting with a model that has been trained via MLE on the training data, they generate a large set of SMILES sequences. They then calculate a reward function $R(S)$ for each sequence $S$ generated and find the subset of $N^{\prime}$ generated molecules with the highest rewards. This subset is used to retrain the model with MLE, and the process is repeated. Each time a new subset of $N^{\prime}$ generated molecules is determined, it is concatenated on the previous set, so the amount of data being used for MLE grows with each iteration. As this process is repeated the model begins to generate molecules which return higher and higher values from $R(S)$ .

A more common technique is to use reinforcement learning after MLE pretraining to fine tune the generator to produce molecules with high reward. The problem of sequence generation can be recast as a reinforcement learning problem with a discrete action space. At each timestep time $t$ , the current state of the “environment” is the sequence generated so far is $(s_{0},\cdots,s_{t})$ and the action $a$ is next token to be chosen, $a=s_{t+1}$ . The goal of reinforcement learning is to maximize the expected return $G_{T}$ for all possible start states $s_{0}$ . The return function $G_{t}=\sum_{i=t}^{T}R_{i}$ simply sums the rewards over the length of time the agent is active, which is called an episode. Mathematically the optimization problem reinforcement learning tries to solve is expressed as:

[TABLE]

where $\theta$ are the parameters of the model. In our case, one episode corresponds to the generation of one molecule, there is only one start state, (the ‘GO’ character) and $R_{t}=0$ until the end-of-sequence (‘EOS’) token is generated or the max string length is reached. If $T$ denotes the max length of the SMILES string then only $R_{T}$ is non-zero and therefore $G_{t}=R_{T}$ for all $t$ . The state transition is deterministic (i.e. $p^{a}_{s,s^{\prime}}=1$ for the next state $S_{1:t+1}$ if the current state is $S_{1:t}$ and the action $a=s_{t+1}$ , while for other states $s^{\prime\prime}$ , $p^{a}_{s,s^{\prime\prime}}=0$ ). Because of these simplifications, eqn. 4 assumes a simple form:

[TABLE]

Here the policy model $\pi_{\theta}(a|s)$ gives the probability for choosing the next action given the current state. In our case:

[TABLE]

There are many reinforcement learning methods, but broadly speaking they can be broken into value learning and policy learning methods. Most work so far has used variants of the REINFORCE algorithm,129 a type of policy learning method which falls into the class of policy gradient methods. It can be shown that for a run of length $T$ the gradient of $J(\theta)$ (eqn. 4) is:

[TABLE]

Computing the exact expectation of $G_{t}$ for all possible action sequences is impossible, so instead the $G_{t}$ from a single run is used before each gradient update. This is sometimes referred to as a “Monte-Carlo” type approach. Fortunately, this process can be parallelized by calculating multiple gradient updates on different threads before applying them. Neil et al. recently tested several newer reinforcement learning algorithms—Advantage Actor-Critic (AAC) and Proximal Policy Optimization (PPO), where they report superior performance over REINFORCE (PPO $>$ AAC $>$ REINFORCE). Interestingly, they find Hillclimb-MLE is competitive with and occasionally superior to PPO.77

Olivecrona et al. argue that policy learning methods are a more natural choice for molecular optimization because they can start with a pre-trained generative model, while value-function learning based methods cannot.73 Additionally, most policy learning methods have been proven to lead to an optimal policy and the resulting generative models are faster to sample.73 In contrast, Zhou et al. argue that value function learning methods are superior in part because policy gradient methods suffer from issues with high variance in the gradient estimates during training.65

Empirically it has been found that using RL after MLE can cause the generated model to “drift” too far, causing important information about viable chemical structures learned during MLE to be lost. This can take the form of highly unstable structures being generated or invalid SMILES. One solution is to “augment” the reward function with the likelihood:73, 130

[TABLE]

Other possibilities are explored by Olivecrona et al.73 Fundamentally, whether “drift” during RL training becomes an issue depends on the details of the reward function—if the reward function is good, drift should be in a good direction. Recently Zhou et al. sought approaches that circumvent MLE when training. In their RL based approach for molecular optimization, they do not use an RNN or any pre-trained generative model and instead use pure RL training.65 The RL agent works directly on constructing molecular graphs, taking actions such as atom/bond addition and atom removal. The particular approach they use is deep- $Q$ learning, which incorporates several recent innovations that were developed at DeepMind and elsewhere.131 Jaques et al. have also explored the application of deep $Q$ -learning and $G$ -Learning to molecular optimization.130 Reinforcement learning is a rapidly developing field, and there remain many recent advancements such as new attention mechanisms which have not yet been tested in the domain of molecular optimization.

To give a flavor of what applications have been demonstrated, we will breifly present some representative works using RNNs. Bjerrum & Threfal explore using an architecture consisting of 256 LSTM cells followed by a “time-distributed dense” output layer.71 Their network achieved a SMILES validity rates of $98$ % and the property distributions for the properties tested matched the property distributions found in the training data (some key properties they looked at were synthetic accessibility score, molecular weight, LogP, and total polar surface area). Popova et al. have shown how an RNN trained for generation can be further trained with reinforcement learning to generate molecules targeted towards a specific biological function - in their case they focused on the decree to which molecules bind and inhibit the JAK2 enzyme, for which much empirical data exists. They showed how their system could be used to either maximize or minimize inhibition with JAK2 and also independently discovered 793 commercially available compounds found in the ZINC database.78 In a similar vein, Segler et al. fine tune an RNN to generate a “focused library” of molecules which are likely to target the 5-HT ${}_{\mbox{{2A}}}$ receptor.74 Finally, Olivecrona et al. show how a generative model can be fine tuned to generate analogs of a particular drug (Celecoxib) or molecules which bind to the type 2 dopamine receptor.73

2.2 Autoencoders

In 2006 Hinton and Salakhutdinov showed how advances in computing power allowed for the training of a deep autoencoder which was capable of beating other methods for document classification.132 The particular type of neural network they used was a stack of restricted Boltzmann machines, an architecture which would later be called a “deep Boltzmann machine".133 While deep Boltzmann machines are theoretically powerful, they are computationally expensive to train and impractical for many tasks. In 2013 Kingma et al. introduced the variational autoencoder (VAE), 21 which was used in 2016 by Bombarelli et al. to create the first machine learning based generative model for molecules.23

2.2.1 Variational autoencoders (VAEs)

VAEs are derived mathematically from the theory of variational inference and are only called autoencoders because the resulting architecture has the same high level structure as a classical autoencoder. VAEs are fundamentally a latent variable model $p(\boldsymbol{x},\boldsymbol{z})=p_{\theta}(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})$ which consists of latent variables $\boldsymbol{z}$ drawn from a pre-specified prior $p(\boldsymbol{z})$ and passed into a decoder $p_{\theta}(\boldsymbol{x}|\boldsymbol{z})$ parametrized by parameters $\theta$ . To apply maximum likelihood learning to such a model we like to maximize the probability of each observed datapoint $p(\boldsymbol{x})=\int p_{\theta}(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})d\boldsymbol{z}$ for all datapoints in our training data. However for complicated models with many parameters $\theta$ (like neural networks) this integral is intractable to compute. The method of variational inference instead maximizes a lower bound on $\log p(\boldsymbol{x})$ :

[TABLE]

where $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ is an estimate of posterior distribution $p(\boldsymbol{z}|\boldsymbol{x})=p_{\theta}(\boldsymbol{x}|\boldsymbol{z})p(\boldsymbol{z})/p(\boldsymbol{x})$ . The right hand side of eqn. 9 is called the “negative variational free energy” or “evidence lower bound” (ELBO) and can be written as:

[TABLE]

Here we encounter the Kullback-Leibler (KL) divergence:

[TABLE]

After several manipulations, eqn. 10 can be written as

[TABLE]

where $H$ is the (differentiable) entropy. The loss function for the variational autoencoder for examples $\boldsymbol{x}$ in our training dataset $\mathcal{Z}$ is:

[TABLE]

In a VAE, during training first $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ (the encoder) generates a $\boldsymbol{z}$ . Then the decoder $p_{\theta}(\boldsymbol{x}|\boldsymbol{z})$ model attempts to recover $\boldsymbol{x}$ . Training is done using backpropagation and the loss function (eqn. 13) which tries to maximize $\mathcal{L}(\boldsymbol{x},\theta,\phi)$ . This corresponds to maximizing the chance of reconstruction $p_{\theta}(\boldsymbol{x}|\boldsymbol{z})$ (the first term) but also minimizing the KL-divergence between $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ and the prior distribution $p(\boldsymbol{z})$ . Typically the prior is chosen to be a set of independent unit normal distributions and the encoder is assumed to be a factorized multidimensional normal distribution:

[TABLE]

The encoder $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ is typically a neural network with parameters $\phi$ used to find the mean and variance functions in $\mathcal{N}(\boldsymbol{z},\boldsymbol{\mu}(\boldsymbol{x}),\mbox{diag}(\boldsymbol{\sigma}^{2}(\boldsymbol{x}))$ . The resulting “Gaussian VAE” has the advantage that the KL-divergence can be computed analytically. The parameters $\theta$ and $\phi$ in the decoder and encoder are all learned via backpropagation. There are several important innovations which have been developed to streamline backpropagation and training which are described in detail elsewhere.21, 120, 134

There are several reasons that variational autoencoders perform better than classical autoencoders. Since the latent distribution is probabilistic, this introduces noise which intuitively can be seen as a type of regularization that forces the VAE to learn more robust representations. Additionally, specifying that the latent space must be a Gaussian leads to a much smoother latent space which makes optimization much easier and also leads to fewer “holes” in the distribution corresponding to invalid or bad molecules. VAEs therefore are useful for interpolation between points corresponding to molecules in the training data.

In the molecular autoencoder of Gómez-Bombarelli et al. each SMILES $\boldsymbol{x}$ is converted to a one-hot representation and a convolutional neural network is used to find the parameters of the Gaussian distribution $q_{\phi}(\boldsymbol{z}|\boldsymbol{x})$ .23 The decoder in the molecular autoencoder is an RNN, but in contrast to pure RNN models, where high rates of valid SMILES generation have been reported (94-98 %),77, 74, 71 the molecular autoencoder generates far fewer valid SMILES. The valid SMILES rate was found to vary greatly between $\approx$ 75% for points near known molecules to only 4% for randomly selected latent points.23 Kusner et al. report an average valid decoding rate of only 0.7% using a similar VAE architecture.56 These low decoding rates are not a fatal issue however simply because a validity checker (such as found in RDKit) can easily be used to throw out invalid SMILES during generation. However, the low rate of validity suggests fundamental issues in quality of the learned latent representation. As mentioned previously, higher rates of SMILES validity have been achieved by representing SMILES in terms of rules from a context-free grammar (CFG).56, 57 Kusner et al. achieved somewhat higher rates of SMILES generation using a CFG (7.2%, as described in the supplementary information of Kusner et al.56). Further work by Dai et al. added additional “semantic” constraints on top of a CFG yielding a higher rate of valid SMILES (43.5%).57 Janz et al. recently proposed using Bayesian active learning as a method of forcing models to learn what makes a sequence valid, and incorporating this into RNNs in VAEs could lead to higher valid decoding rates.135

2.2.2 Adversarial autoencoders (AAEs)

Adversarial autoencoders are similar to variational autoencoders, but differ in the means by which they regularize the latent distribution by forcing it to conform to the prior $p(\boldsymbol{z})$ .136 Instead of minimizing KL-divergence metric to enforce the generator to output a latent distribution corresponding to a prespecified prior (usually a normal distribution), they use adversarial training with a discriminator $D$ whose job is to distinguish the generator’s latent distribution from the prior. The discriminator outputs a probability $p\in(0,1)$ which predicts the probability samples it sees are from the prior. The objective of the discriminator is maximize the following:

[TABLE]

The overall objective function for the AAE to minimize can be expressed as

[TABLE]

2.2.3 Supervised VAEs/AAEs for property prediction & optimization

In supervised VAEs, target properties $\boldsymbol{y}$ for each molecule are incorporated into the generator in addition to the SMILES strings or other molecular representation. Figure 2 shows several different ways this can be done, representing the generative models as graphical models. Everything we discuss in this section can also be applied to AAEs,136 but we restrict our discussion to VAEs for simplicity.

In the work by Gómez-Bombarelli et al. they attached a neural network (multilayer perceptron) to the latent layer and jointly trained the neural network to predict property values $y$ and the VAE to minimize reconstruction loss. One unique property they optimize after training such a system is the predicted HOMO-LUMO gap, which is important for determining a molecule’s utility in organic solar cells. The advantage of supervised VAEs is that the generator learns a good latent representation both for property prediction and reconstruction. With the property predictor trained, it becomes possible to do property optimization in the latent space, by either using Gaussian process optimization or gradient ascent. Interestingly, in supervised VAEs a particular direction in the latent space always became correlated with the property value $y$ , while this was never observed in unsupervised VAEs. When one desires to do optimization, Gómez-Bombarelli et al. argue for using a Gaussian process model as the property predictor instead of a neural network, because it generates a smoother landscape.23 The specific procedure they used was to first train a VAE using supervised training with a neural network property predictor and then train a Gaussian process model separately using the latent space representations of the training molecules as input. They then use the Gaussian process model for optimization, and they showed it was superior to a genetic optimization algorithm and random search in the latent space. Since that work, several other researchers have used Gaussian process regression to perform optimization in the latent space of a VAE.109, 137, 98

Two other types of supervised VAEs are shown in figure 2, which we call “type 2” and “type 3”. Unlike the autoencoder frame work discussed in the previous section, these two types of autoencoder can be used for conditional generation. In “type 3” supervised VAEs the ELBO term in the objective function (eqn. 12) becomes:88

[TABLE]

Kang et al. assume that the property values have a Gaussian distribution. Type 3 VAEs are particularly useful when $\boldsymbol{y}$ is known for only some of the training data (a semi-supervised setting). In semi-supervised VAEs, the generator is tasked with predict $\boldsymbol{y}$ and is trained on the molecules where $\boldsymbol{y}$ is known and makes a best guess prediction for the rest. In effect, when $\boldsymbol{y}$ is not known, it becomes just another latent variable and a different objective function is used (for details, see Kang et al.88).

In Type 2 VAEs, property data $\boldsymbol{y}$ is embedded directly into the latent space during training.87, 110 Supervised and semi-supervised VAEs can both be used for conditional sampling, and thus are sometimes called “conditional VAEs”. In the traditional way of doing conditional sampling, $\boldsymbol{y}$ is specified and then one samples from the prior $p(\boldsymbol{z})$ . Then one samples from the generator $p_{\theta}(\boldsymbol{x}|\boldsymbol{y},\boldsymbol{z})$ . In the case of Type 1 and Type 2 VAEs, however, there is an issue pointed out by Polykovskiy et al. which they call “entanglement”.110 The issue is that when sampling we assumed that $p(\boldsymbol{z})$ is independent of $p(\boldsymbol{y})$ . However, the two distributions are actually “entangled” by the implicit relationship between $x$ and $\boldsymbol{y}$ which is in the training data (this is indicated by a dashed line in figure 2). For consistency, one should be sampling from $p(\boldsymbol{z}|\boldsymbol{y})$ . Polykovskiy et al. developed two “disentanglement” approaches to ameliorate this issue: learning $p(\boldsymbol{z}|\boldsymbol{y})$ and forcing all $p(\boldsymbol{z}|\boldsymbol{y})$ to match $p(\boldsymbol{z})$ .110

When generating molecules with an RNN, we previously discussed sampling from the model’s distribution by simply running the model and taking either the token with the maximum probability or using a multinomial sampler at each step of the sequence generation. When sampling from the generator of a conditional VAE, we wish to know what the model says is the likely molecule given $\boldsymbol{y}$ and $\boldsymbol{z}$ , since we are interested in focusing on the molecules the model predicts are most likely to be associated with a particular set of properties:

[TABLE]

Taking the most likely token at each step (the “greedy” method) is only a rough approximation to $\hat{\boldsymbol{x}}$ . Unfortunately, completing the optimization in eqn. 18 is a computationally intractable problem because the space of sequences grows exponentially with the length of the sequence. However, a variation on the greedy method called “beam search” can be used to get an approximation of $\hat{\boldsymbol{x}}$ .138, 88 In brief, beam search keeps the top $K$ most likely (sub)sequences at each step of the generation.

2.3 Generative adversarial networks (GANs)

The key idea underlying GANs is to introduce a discriminator network whose job is to distinguish whether the molecule it is looking at was generated by the generative model or came from the training data. In GAN training, the objective of the generative model becomes to try to fool the discriminator rather than maximizing likelihood. There are theoretical arguments and growing empirical evidence showing that GAN models can overcome some of the well known weaknesses of maximum likelihood based training. However, there are also many technical difficulties which plague GAN training and getting GANs to work well typically requires careful hyperparameter tuning and implementation of several non-obvious “tricks”.139 GANs are a rapidly evolving research area, and given space limitations we can only touch on several of the key developments here.

The original paper on GANs (Goodfellow et al 2014) introduced the following objective function:22

[TABLE]

Here $p_{d}(\boldsymbol{x})$ is the data distribution. This form of the objective function has a nice interpretation as a two person minimax game. However, this objective function is rarely used in practice for a few reasons. Firstly, as noted in the original paper, this objective function does not provide a very strong gradient signal when training starts because then $\log(1-D(G(\boldsymbol{z}))$ saturates (goes to negative infinity) and the numerical derivative becomes impossible to calculate. Still, understanding this objective function can help understand how generative modeling with GAN training can be superior to maximum likelihood based generative modeling. For a fixed $G$ , the optimal discriminator is:

[TABLE]

If we assume $D=D_{G}^{*}$ , then the objective function $\mathcal{C}(G)$ for the generator can be expressed as:22

[TABLE]

Where $D_{\mbox{{JS}}}(p_{d},p_{\theta_{G}})$ is the Jensen-Shannon divergence:

[TABLE]

Here $D_{\mbox{{KL}}}(p,q)$ is the Kullback-Leibler (KL) divergence. Maximizing the log-likelihood is equivalent to minimizing the forward KL divergence $D_{\mbox{{KL}}}(p_{d},p_{\theta_{G}})$ .134 To better understand what this means, we can rewrite the equation for KL divergence (eqn. 11) in a slightly different way:

[TABLE]

This shows us that KL divergence captures the difference between $p_{d}$ and $p_{\theta_{G}}$ weighted by $p_{d}$ . Thus one of the weaknesses of maximum likelihood is that $p_{\theta_{G}}$ may have significant deviations from $p_{d}$ when $p_{d}\approx 0$ . To summarize, the forward KL divergence ( $D_{\mbox{{KL}}}(p_{d},p_{\theta_{G}})$ ) punishes models that underestimate the data distribution, while the reverse KL divergence ( $D_{\mbox{{KL}}}(p_{\theta_{G}},p_{d})$ ) punishes models that overestimate the data distribution. Therefore we see that eqn. 21, which contains both forward and backwards KL terms, takes a more “balanced” approach than maximum likelihood, which only optimizes forward KL divergence. Optimizing reverse KL divergence directly requires knowing an explicit distribution for $p_{d}$ , which usually is not available. In effect, the discriminator component of the GAN works to learn $p_{d}$ , and thus GANs provide a way of overcoming this issue.

As noted before, the GAN objective function given in eqn. 19, however, does not provide a good gradient signal when training first starts since typically $p_{d}$ and $p_{\theta_{G}}$ have little overlap to begin with. Empirically, this occurs because data distributions typically lie on a low dimensional manifold in a high dimensional space, and the location of this manifold is not known in advance. The Wasserstein GAN (WGAN) is widely accepted to provide a better metric for measuring the distance between $p_{d}$ and $p_{\theta_{G}}$ than the original GAN objective function and results in faster and more stable training.140 The WGAN is based on the Wasserstein metric (also called the “Earth mover’s distance") which can be informally understood by imagining the two probability distributions $p_{d}$ and $p_{\theta_{G}}$ to be piles of dirt, and the distance between them to be the number of buckets of dirt that need to be moved to transform one to the other, times the sum of the distances each bucket must be moved. Mathematically this is expressed as:

[TABLE]

$\Pi(x,y)$ can be understood to be the optimal “transport plan” explaining how much probability mass is moved from $x$ to $y$ . Another feature of the WGAN is the introduction of a “Lipschitz constraint” which clamps the weights of the discriminator to lie in a fixed interval. The Lipschitz constraint has been found to result in a more reliable gradient signal for the generator and improve training stability. Many other types of GAN objective function have been developed which we do not have room to discuss here. For a review of the major GAN objective functions and latest techniques, see Kurach et al.141

Several papers have emerged so far applying GANs to molecular generation—Guimares et al. (ORGAN),102 Sánchez-Lengling et al. (ORGANIC), 107 De Cao & Kipf (MolGAN),62, 49 and Putin et al. (RANC, ATNC).103, 104 The Objective-Reinforced GAN (ORGAN) of Guimares et al. uses the SMILES molecular representation and an RNN (LSTM) generator and a CNN discriminator.102 The architecture of the ORGAN is taken from the SeqGAN of Yu et al. 142 and uses a WGAN. In ORGAN, the GAN objective function is modified by adding an additional “objective reinforcement” term to the generator RNN’s reward function which biases the RNN to produce molecules with a certain objective property or set of objective properties. Typically the objective function returns a value $R(S)\in[0,1]$ . The reward for a SMILES string $S$ becomes a mixture of two terms:

[TABLE]

where $\lambda\in[0,1]$ is a tunable hyperparameter which sets the mixing between rewards for fooling the discriminator and maximizing the objective function. The proof of concept of the ORGAN was demonstrated by optimizing three quick-to-evaluate metrics which can be computed with RDKit—druglikeliness, synthesizability, and solubility. Proof of concept applications of the ORGAN have been demonstrated in two domains - firstly for drug design it as shown how ORGAN can be used to mazimize Lapinksi’s rule of five metric as well as the quantitative estimate of drug likeliness metric. The second application for which ORGAN was demonstrated is maximizing the power conversion efficiency (PCE) of molecules for use in organic photovoltaics, where PCE is estimated using a machine learning based predictor that was previously developed.107

2.3.1 The perfect discriminator problem and training instabilities

GAN optimization is a saddle point optimization problem, and such problems are known to be inherently unstable. If gradients for one part of the optimization dominate, optimizers can run or “spiral” away from the saddle point so that either the generator or the discriminator achieves a perfect score. The traditional approach to avoiding the perfect discriminator problem, taken by Guimares et al. and others, is to do additional MLE pretraining with the generator to strengthen it and then do $m$ gradient updates for the generator for every one gradient update for the discriminator. In this method, $m$ must be tuned to balance the discriminator and generator training. A different, more dynamic method for balancing the discriminator and generator was invented by Kardurin et al. in their work on the DruGAN AAE.69 They introduce a hyperparameter $0.5<p<1$ which sets the desired “discriminator power”. Then, after each training step, if the discriminator correctly labels samples from the generator with probability less than $p$ , the discriminator is trained, otherwise the generator is trained. Clearly $p$ should be larger than 0.5 since the discriminator should do better than random chance in order to challenge the generator to improve. Empirically, they found $p=3/5$ to be a good value.

Putin et al. show that the ORGANIC model107 with its default hyperparameters suffers from a perfect discriminator problem during training, leading to a plateauing of the generator’s loss.103 To help solve these issues, Putin et al. implemented a differentiable neural computer (DNC) as the generator.103 The DNC (Graves et al, 2016)143 is an extension of the neural Turing machine (Graves et al 2014)144 that contains a differentiable memory cell. A memory cell allows the generator to memorize key SMILES motifs, which results in a much “stronger” generator. They found that the discriminator never achieves a perfect score when training against a DNC. The strength of the DNC is also shown by the fact that it has a higher rate of valid SMILES generation vs. the ORGAN RNN generator (76% vs.

24%) and generates SMILES that are on average twice as long as the SMILES generated by ORGAN. In a subsequent work, Putin et al. also introduced the adversarial threshold neural computer, another architecture with a DNC generator.104

Another issue with GANs is mode collapse, where the generator only generates a narrow range of samples. In the context of molecules, an example might be a generator that only generates molecules with carbon rings and less than 20 atoms.

3 Metrics and reward functions

A key issue in deep generative modeling research is how to quantitatively compare the performance of different generative models. More generally a decline in rigor in the field of deep learning as a whole has been noted by Sculley, Rahimi and others.145 While the recent growth in the number of researchers in the field has obvious benefits, the increased competition that can result from such rapid growth disincentivizes taking time for careful tuning and rigorous evaluation of new methods with previous ones. Published comparisons are often subtly biased towards demonstrating superior performance for technically novel methods vs. older more conventional methods. A study by Lucic et al. for instance found that in the field of generative adversarial networks better hyperparameter tuning and training lead to most recently proposed methods reaching very similar results.139, 146 Similarly, Melis et al. found that with proper hyperparameter tuning a conventional LSTM architecture could beat several more recently proposed methods for natural language modeling.147 At the same time, there is a reproducibility crisis afflicting deep learning—codes published side-by-side with papers often give different results than what was published,141 and in the field of reinforcement learning it has been found that codes which purport to do the same thing will give different results.146 The fields of deep learning and deep generative modeling are still young however, and much work is currently underway on developing new standards and techniques for rigorously comparing different methods. In this section we will discuss several of the recently proposed metrics for comparing generative models and the closely related topic of reward function design for molecular optimization.

3.1 Metrics for scoring generative models

Theis et al. discuss three separate approaches—log-likelihood, estimating the divergence metric between the training data distribution $p(x)$ and the model’s distribution $q(x)$ , and human rating by visual inspection (also called the “visual Turing test”) .148, 149 Interestingly, they show that these three methodologies measure different things, so good performance under one does not imply good performance under another.149

The “inception score” (IS) uses a third-party neural network which has been trained in a supervised manner to do classification.148 In the original IS, Google’s Inception network trained on ImageNet was used as the third-party network. IS computes the divergence between the distribution of classes predicted by the third-party neural network on generated molecules with the distribution of classes predicted for the dataset used to train the neural network. The main weakness of IS is that much information about the quality of images is lost by focusing only on classification labels. The Fréchet Inception Distance (FID) builds off the IS by comparing latent vectors obtained from a late-stage layer of a third-party classification network instead of the predictions.150 Inspired by this, Preuer et al. created the Fréchet ChemNet Distance metric for evaluating models that generate molecules.151 Unfortunately, there is a lack of agreement on how to calculate the FID—some report the score by comparing training data with generated data, while others report the score comparing a hold out test set with the generated data.141 Comparing with test data gives a more useful metric which measures generalization ability, and is advocated as a standard by Kurach et al.141

In the world of machine learning for molecular property prediction, MoleculeNet provides a benchmark to compare the utility of different regression modeling techniques across a wide range of property prediction problems.152 Inspired by MoleculeNet, Polykovskiy and collaborators have introduced the MOlecular SEtS (MOSES) package to make it easier to build and test generative models.153 To compare the output of generative models, they provide functions to compute Fréchet ChemNet Distance, internal diversity, as well as several metrics which are of general importance for pharmaceuticals: molecular weight, logP, synthetic accessibility score, and the quantitative estimation of drug-likeness. In a similar vein, Benhenda et al. have released the DiversityNet benchmark, which was also (as the name suggests) inspired by MoleculeNet.154 Finally, another Python software package called GuacaMol has also been released which contains 5 general purpose benchmarking methods and 20 “optimization specific” benchmarking methods for drug discovery.155 One unique feature of GuacaMol is the ability to compute KL-divergences between the distributions from generated molecules and training molecules for a variety of physio chemical descriptors.

Recently in the context of generative modeling of images with GANs, Im et al. have shown significant pitfalls to using the Inception Distance metric.156 As an alternative, they suggest using the same type of divergence metrics that are used during GAN training. This method has been explored recently to quantify generalization performance of GANS157 and could be of use to the molecular modeling community.

3.2 Reward function design

A good reward function is often important for molecular generation and essential for molecular optimization. The pioneering molecular autoencoder work resulted in molecules which were difficult to synthesize or contained highly labile (reactive or unstable) groups such as enamines, hemiaminals, and enol ethers which would rapidly break apart in the body and thus were not viable drugs.158 Since then, the development of better reward functions has greatly helped to mitigate such issues, but low diversity and novelty remains an issue.159, 160, 161 After reviewing the work that has been done so far on reward function design, we conclude that good reward functions should lead to generated molecules which meet the following desiderata:

Diversity—the set of molecules generated is diverse enough to be interesting. 2. 2.

Novelty—the set of molecules does not simply reproduce molecules in the training set. 3. 3.

Stability—the molecules are stable in the target environment and not highly reactive. 4. 4.

Synthesizability—the molecules can actually be synthesized. 5. 5.

Non-triviality—the molecules are not degenerate or trivial solutions to maximizing the reward function. 6. 6.

Good properties—the molecules have the properties desired for the application at hand.

3.2.1 Diversity & novelty

A diversity metric is a key component of any reward function, especially when using a GAN, where it helps counter the issue of mode collapse to a non-diverse subset. Given a molecular similarity metric between two molecules $T(x_{1},x_{2})\in[0,1]$ the diversity of a generated set $\mathcal{G}$ can be defined quite simply as:

[TABLE]

A popular metric is the Tanimoto similarity between two extended-connectivity fingerprint bit vectors.153 Since the diversity of a single molecule does not make sense, diversity rewards are calculated on mini-batches during mini-batch stochastic gradient descent training. Eqn. 26 is called “internal diversity”. An alternative which compares the diversity of the generated set with the diversity of the training data is the nearest neighbor similarity (SNN) metric:153

[TABLE]

Of course, too much diversity can also be an issue. One option is to use the following negative reward which biases the generator towards matching the diversity of the training data:

[TABLE]

Another diversity measure that has been employed is uniqueness, which aims to reduce the number of repeat molecules. The uniqueness reward $R_{\mbox{{uniqueness}}}\in(0,1]$ is defined as:

[TABLE]

Where $|\mbox{set}(\mathcal{G})|$ is the number of unique molecules in the generated batch $\mathcal{G}$ and $|\mathcal{G}|$ is the total number of molecules.

Novelty is just as important as diversity since a generator which just generates the training dataset over and over is of no practical utility. Guimares et al. define the “soft novelty” for a single molecule as:102

[TABLE]

When measuring the novelty of molecules generated post-training to get an overall novelty measure for the model, it is important to do so on a hold-out test set $\mathcal{T}$ . Then one can look at how many molecules in a set of generated molecules $\mathcal{G}$ appear in $\mathcal{T}$ and use a novelty reward such as:74

[TABLE]

which gives the fraction of generated molecules not appearing in the test set. The diversity of the generated molecules and how they compare to the diversity of the training set can also be visualized by generating fingerprint vectors (which typically have dimensionalities of $d>100$ ) and then projecting them into two dimensions using dimensionality reduction techniques. The resulting 2D point cloud can then be compared with the corresponding points from the training set and/or a hold out test set. There are many possible dimensionality reduction techniques to choose from—Yoshikawa et al.160 use the ISOMAP method,162 Merk et al.80 use multidimensional scaling, and Selger et al.74 use t-SNE projection.163

Interpolation between training molecules may be a useful way to generate molecules post-training which are novel, but not too novel as to be unstable or outside the intended property manifold. In the domain of image generation, GANs seem to excel at interpolation vs. VAEs, for reasons that are not yet fully understood. For instance with GANs trained on natural images, interpolation can be done between a $\boldsymbol{z}$ point corresponding to a frowning man to a point $\boldsymbol{z}^{\prime}$ corresponding to a smiling woman, and all of the intervening points result in images which make sense.164 Empirically most real world high dimensional datasets are found to lie on a low density manifold.165 Ideally, the dimensionality of the latent space $p(\boldsymbol{z})$ used in a GAN, VAE, or AAE will correspond to the dimensionality of this manifold. If the dimensionality of $p(\boldsymbol{z})$ is higher than the intrinsic dimensionality of the data manifold, then interpolation can end up going “off manifold” into so-called “dead zones”. For high dimensional latent spaces with a Gaussian prior, most points will lie on a thin spherical shell. In such cases it has been found empirically that better results can be found by doing spherical linear interpolation or slerp.166 The equation for slerp interpolation between two vectors $\boldsymbol{v}_{1}$ and $\boldsymbol{v}_{2}$ is

[TABLE]

where $\theta=\arccos(\boldsymbol{v}_{1}\cdot\boldsymbol{v}_{2})$ and $0\leq t\leq 1$ is the interpolation parameter.

Another option for generating molecules close to training molecules but not too close is to have a reward for being similar to the training data but not too similar. Olivecrona et al. use a reward function $R_{s}(S)\in[-1,1]$ of the form:73

[TABLE]

here $S$ is the input SMILES and $T$ is the target SMILES, and $\mbox{Sim}(S,T)\in[0,1]$ is similarity scoring function which computes fingerprint-based similarity between the two molecules. $k$ is a tunable cutoff parameter which sets the maximum similarity accepted. This type of reward can be particularly useful for generating focused libraries of molecules very similar to a single target molecule or a small set of “actives” which are known to bind to a receptor. Note that most generative models can be run so as to generate molecules close to a given molecule. For instance, with RNNs, one can do “fragment growing”, which allows molecular designers to explore molecules which share a predefined scaffold.72 Similarly, with reinforcement learning one can simply start the agent with a particular molecule and let it add or remove bonds. Finally, with a VAE one can find the latent representation for a given molecule and then inject a small amount of Gaussian noise to generate “nearby” molecules.89

3.2.2 Stability and synthesizability

Enforcement of synthesizability has thus far mainly been done using the synthetic accessibility (SA) score developed by Ertl & Schuffenhauer, 167 although other synthesizability scoring functions exist.168, 169 The model underlying the SA score was designed and fit specifically to match scores from synthetic chemists on a set of drug molecules, and therefore may be of limited applicability to other types of molecules. When using SA score as a reward in their molecular autoencoder, Gómez-Bombarelli et al. found that it still produced a large number of molecules with unrealistically large carbon rings. Therefore, they added an additional “RingPenalty” term to penalize rings with more than six carbons. In the ORGANIC GAN code, Sánchez-Lengling et al. added several penalty terms to the original SA score function, and also developed an additional reward for chemical symmetry, based on the observation that symmetric molecules are typically easier to synthesize.107

For drug molecules, the use of scoring functions developed to estimate how “drug-like” or “natural” a compound is can also help improve the synthesizability, stability, and usefulness of the generated molecules.102 Examples of such functions include Lipinski’s Rule of Five score,5 the natural product-likeness score,170 the quantitative estimate of drug-likeness,171 and the Muegge metrics.172, 173 Another option of particular utility to drug discovery is to apply medicinal chemistry filters either during training or post-training to tag unstable, toxic, or unsynthesizable molecules. For drug molecules, catalogs of problematic functional groups to avoid have been developed in order to limit the probability of unwanted side-effects.174 For energetic molecules and other niche domains an analogous set of functional groups to avoid has yet to be developed.

Many software packages exist for checking molecule’s stability and syntheszability which may be integrated into training or as a post-training filter. For example Popova et al. use the ChemAxon structure checker software to do a validity check on the generated molecules.78 Bjerrum et al. use the Wiley ChemPlanner software post-training to find synthesis routes for 25-55% of the molecules generated by their RNN.71 Finally, Sumita et al. check for previously discovered synthetic routes for their generated molecules using a SciFinder literature search.79

It is worth mentioning that deep learning is now being used for the prediction of chemical reactions11, 12 and synthesis planning. Segler et al. trained a deep reinforcement learning system on a large set of reactions that have been published in organic chemistry and demonstrated a system that could predict chemical reactions and plan synthetic routes at the level of expert chemists.13

3.2.3 Rewards for good properties

Because they are called often during training, reward functions should be quick to compute, and therefore fast property estimators are called for. Examples of property estimation functions which are fast to evaluate are the estimated octanol-water partition coefficient (LogP), and the quantitative measure of drug-likeness (QED),171 both of which can be found in the open source package RDKit.51

Since physics based prediction is usually very computationally intensive, a popular approach is to train a property predicting machine learning model ahead of time. There is now an enormous literature on deep learning for property prediction demonstrating success in multiple areas.8 Impressive results have been obtained, such as systems which can predict molecular energies at DFT accuracy,175 and highly accurate systems which can transfer between many types of molecules.176 While the literature on quantum property prediction is perhaps the most developed, success has been seen in other areas, such as calculating high level properties of energetic molecular crystals (such as sensitivity and detonation velocity).10, 27 Many predictive models are now published for “off the shelf” use, for instance a collection of predictive models for ADME (absorption, distribution, metabolism, and excretion) called “SwissADME” was recently published.177

It has also been demonstrated that traditional physics-based simulations can be used—Sumita et al. optimize absorption wavelength by converting SMILES strings into 3D structures using RDKit and then calculating their UV-VIS absorption spectra on-the-fly with time-dependent density functional theory (TD-DFT). Instead of the obvious reward function $-\alpha|\lambda^{*}-\lambda|$ , where $\lambda^{*}$ is the target wavelength, they used the following:79

[TABLE]

From the molecules generated by their RNN, Sumita et al. selected six molecules for synthesis and found that 5/6 exhibited the desired absorption profiles.79

A reward function which has been used by several different researchers to generate drug molecules is:

[TABLE]

Yang et al. add an additional penalty for generating invalid SMILES which could be used more broadly:75

[TABLE]

In the context of training the ORGAN architecuture, Guimares et al. found that rotating the reward function metric from epoch to epoch had some advantages to using all metrics at once.102 In other words, in one epoch the rewards may just be for diversity, while in the next they would just be for synthesizability, and so on. This idea could likely be explored further.

4 Prospective and future directions

In this review we have tried to summarize the current state of the art for generative modeling of molecules using deep learning. The current literature is composed of a rich array of representation strategies and model architectures. As in many areas of generative modeling and deep learning, the present day work is largely empirical in nature. As our mathematical understanding of the landscape of generative models improves, so too will our ability to select the best approaches to a particular problem. There are many promising new techniques and architectures from deep generative modeling and optimization more broadly which are ripe to be transferred to the world of molecules. For example, for sequence modeling the Maximum Likelihood Augmented Discrete GAN (MaliGAN) has been shown to be superior to the SeqGAN on which ORGAN is based.178 With RNNs, recently developed attention mechanisms and external memory cells offer a possible avenue to improve SMILES string generation.179

It is worth noting that the latest genetic algorithm based methods can still compete with today’s deep learning based methods. Yoshikawa et al. developed a genetic algorithm which makes edits to SMILES and uses population-based evolutionary selection to generate molecules with high binding affinity as calculated via docking (rDock).38 They compared their method with three other deep-learning based methods (CVAE23, GVAE56, and ChemTS75) for optimizing the “penalized log P score” (eqn. 35). They found that with computer time fixed to eight hours, their method performed better or comparable to the deep learning methods. In a similar vein, Jensen found that a genetic algorithm performed better than a SMILES based RNN+Bayesian optimization, the ChemTS RNN, and a CVAE with 100x lower computational cost.180 It appears that genetic algorithms can explore chemical space in a very efficient manner.

In our discussion of GANs we highlighted an important way in which GANs are superior to maximum likelihood based methods—namely that they can avoid the “filling in” problem which occurs with maximum likelihood where the model’s distribution ends up non-zero where the data distribution is zero. Another point is that the theorems on which the maximum likelihood methodology is based only hold true in the limit of infinite samples.181 In general it appears that GANs can be trained with far fewer samples than maximum likelihood based methods—this can be seen by looking at the $N_{\mbox{{train}}}$ values in table 2. In addition to their benefits, we also touched on several difficulties with GANs—small starting gradient, training instabilities, the perfect discriminator problem, and mode collapse. However, we also cited possible remedies for each of these issues and we expect more remedies to be developed in the future.

There are several major trends we have observed which present day practitioners and those entering the field should be cognizant of:

New representation methods SMILES based techniques are quickly being replaced with techniques that work directly with chemical graphs and three dimensional chemical structures. Directly working with chemical graphs, either by using chemistry-preserving graph operations or tensor representations avoids the problems aropssociated with requiring deep generative models to learn SMILES syntax. At the same time, there is also growing interest in generative models which can generate 3D equilibrium structures, since in many applications the specific 3D positions of atoms can be important.

Better reward functions As mentioned earlier, reward function design is a critical component to molecular generation and optimization. We expect future work will use more sophisticated reward functions which combine multiple objectives into a single reward function. Using multiple reward functions and multi-objective reinforcement learning is also a promising approach.65

Pure reinforcement learning approaches The deep reinforcement learning work of Zhou et al. demonstrated superior molecular optimization performance when compared with the Junction Tree VAE,93 ORGAN,102 and Graph Convolutional Policy Network64 approaches when optimizing the logP and QED metrics.65 The work of Zhou et al. is notable as it is the first to take a pure RL approach with no pretrained generator. We believe much future work in molecular optimization will proceed in this direction since many application areas have limited training data available.

Hierarchical modeling Hierarchical representation schemes will allow for efficient generative modeling of large molecules (such as proteins182, 183) as well as complex systems such as polymers, metal organic frameworks, and molecular crystals. Generative modeling techniques will also be useful not just for optimizing molecules but also optimizing the structures and systems in which they are embedded. GANs have recently been applied to the generation of crystal structures184 and microstructures.185, 186, 187 Hierarchical GANs188 may be useful for the generation of many-molecule complexes or for the simultaneous optimization of both material and geometry in nanomaterials and metamaterials.

Closing the loop After the synthesis and characterization of new lead compounds the data obtained can be fed back to improve the models used, a process called “closing the loop”. More work is needed to develop workflows and methods to interface and integrate generative models into laboratory platforms to allow for rapid feedback and cycling. A key challenge is developing useful software and cyberinfrastructure for computational screening and data management.189 The potential for efficiency improvements via automated AI-assisted synthesis planning and “self-driving” robotic laboratories is quite profound.190, 191, 192, 14, 15

Acknowledgements

Support for this work is gratefully acknowledged from the U.S. Office of Naval Research under grant number N00014-17-1-2108 and from the Energetics Technology Center under project number 2044-001. Partial support is also acknowledged from the Center for Engineering Concepts Development in the Department of Mechanical Engineering at the University of Maryland, College Park. We thank Dr. Ruth M. Doherty, Dr. William Wilson, and Dr. Andrey Gorlin for their input and for proofreading the manuscript.

Bibliography192

The reference list from the paper itself. Each links out to its DOI / PubMed record.

11 Joseph A. Di Masi, Henry G. Grabowski, and Ronald W. Hansen. Innovation in the pharmaceutical industry: New estimates of R&D costs. Journal of Health Economics , 47:20 – 33, 2016.
22 Steven M. Paul, Daniel S. Mytelka, Christopher T. Dunwiddie, Charles C. Persinger, Bernard H. Munos, Stacy R. Lindborg, and Aaron L. Schacht. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nature Reviews Drug Discovery , 9(3):203–214, feb 2010.
33 Axel Homburg. Remarks on the evolution of explosives. Propellants, Explosives, Pyrotechnics , 42(8):851–853, aug 2017.
44 P. G. Polishchuk, T. I. Madzhidov, and A. Varnek. Estimation of the size of drug-like chemical space based on GDB-17 data. Journal of Computer-Aided Molecular Design , 27(8):675–679, Aug 2013.
55 Christopher A. Lipinski, Franco Lombardo, Beryl W. Dominy, and Paul J. Feeney. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced Drug Delivery Reviews , 23(1-3):3–25, jan 1997.
66 Ricardo Macarron, Martyn N. Banks, Dejan Bojanic, David J. Burns, Dragan A. Cirovic, Tina Garyantes, Darren V. S. Green, Robert P. Hertzberg, William P. Janzen, Jeff W. Paslay, Ulrich Schopfer, and G. Sitta Sittampalam. Impact of high-throughput screening in biomedical research. Nature Reviews Drug Discovery , 10(3):188–195, mar 2011.
77 Edward O. Pyzer-Knapp, Changwon Suh, Rafael Gómez-Bombarelli, Jorge Aguilera-Iparraguirre, and Alán Aspuru-Guzik. What is high-throughput virtual screening? a perspective from organic materials discovery. Annual Review of Materials Research , 45(1):195–216, jul 2015.
88 Keith T. Butler, Daniel W. Davies, Hugh Cartwright, Olexandr Isayev, and Aron Walsh. Machine learning for molecular and materials science. Nature , 559(7715):547–555, jul 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Contents

1 Molecular representation

1.1 Representations of 3D geometry

1.2 Representations of molecular graphs

1.2.1 SMILES and string-based representations

1.2.2 Image-based representations

1.2.3 Tensor representations

1.2.4 Other graph-based representations

2 Deep learning architectures

2.1 Recurrent neural networks (RNNs)

2.1.1 Optimization with RNNs using reinforcement learning

2.2 Autoencoders

2.2.1 Variational autoencoders (VAEs)

2.2.2 Adversarial autoencoders (AAEs)

2.2.3 Supervised VAEs/AAEs for property prediction & optimization

2.3 Generative adversarial networks (GANs)

2.3.1 The perfect discriminator problem and training instabilities

3 Metrics and reward functions

3.1 Metrics for scoring generative models

3.2 Reward function design

3.2.1 Diversity & novelty

3.2.2 Stability and synthesizability

3.2.3 Rewards for good properties

4 Prospective and future directions

Acknowledgements