Predicting physical properties of alkanes with neural networks

Pavao Santak; Gareth Conduit

arXiv:1908.02067·physics.comp-ph·August 7, 2019

Predicting physical properties of alkanes with neural networks

Pavao Santak, Gareth Conduit

PDF

TL;DR

This paper demonstrates that neural networks can accurately predict various physical properties of alkanes, leveraging fragmented data and chemical descriptors, outperforming traditional methods.

Contribution

Introduces neural network models that predict alkane properties using chemical descriptors and property correlations, even with fragmented data, improving prediction accuracy over existing methods.

Findings

01

Neural networks accurately predict boiling point, heat capacity, vapor pressure, and melting point.

02

Property-property correlations enhance prediction quality.

03

Modeling of viscosity and density as functions of temperature and pressure.

Abstract

We train artificial neural networks to predict the physical properties of linear, single branched, and double branched alkanes. These neural networks can be trained from fragmented data, which enables us to use physical property information as inputs and exploit property-property correlations to improve the quality of our predictions. We characterize every alkane uniquely using a set of five chemical descriptors. We establish correlations between branching and the boiling point, heat capacity, and vapor pressure as a function of temperature. We establish how the symmetry affects the melting point and identify erroneous data entries in the flash point of linear alkanes. Finally, we exploit the temperature and pressure dependence of shear viscosity and density in order to model the kinematic viscosity of linear alkanes. The accuracy of the neural network models compares favorably to the…

Tables6

Table 1. Table 1 : Summary of accuracy of three boiling point models. Our neural network model is compared to two regression models that use molecular structure and topological indices as inputs.

Method	$R^{2}$	AAD $(^{\circ} C)$
Neural Network	0.995	1.69
Model 7.2 [29]	0.977	2.47
Model 7.3 [29]	0.975	2.24

Table 2. Table 2 : Summary of comparison of accuracies of the neural network model and a second order group additivity model for the heat capacity.

Method	$R^{2}$	AAD ( $J {(molK)}^{- 1})$
Neural Network	0.996	2.10
Second Order Group Additivity [32]	0.994	2.87

Table 3. Table 3 : Experimental values and prediction of the flash point for indicated molecules. The table compares the accuracy of our neural network model with the accuracy of a model based on the group contribution method.

Molecule	Experimental (^∘C)	Group Contribution Method [31] Prediction (^∘C)	Neural Network Prediction (^∘C)	Group Contribution Method [31] Absolute Deviation (^∘C)	Neural Network Absolute Deviation (^∘C)
Ethane	-139.16	-129.04	-137.63	10.12	1.53
Propane	-106.49	-97.15	-106.36	9.34	0.13
Butane	-74.00	-71.15	-73.47	2.85	0.53
Pentane	-47.21	-47.15	-46.23	0.06	0.98
Hexane	-17.40	-26.15	-23.02	8.75	5.62
Heptane	-7.12	-6.15	-2.60	0.97	4.52
Octane	16.15	11.85	16.00	4.30	0.15
Nonane	29.29	28.85	33.52	0.44	4.23
Decane	50.45	44.85	50.61	5.60	0.16
Undecane	69.45	60.85	67.89	8.60	1.56
Dodecane	85.43	74.85	83.92	10.58	1.51
Tridecane	100.32	89.85	98.21	10.47	2.11
Tetradecane	111.30	102.85	111.24	8.45	0.06
Pentadecane	122.55	115.85	123.34	6.70	0.79
Hexadecane	131.67	128.85	134.65	2.82	2.98
Heptadecane	146.83	141.18	145.41	5.65	1.42
Octadecane	156.28	153.14	155.81	3.14	0.47
Nonadecane	167.25	164.79	165.80	2.46	1.45
Icosane	175.96	176.13	175.34	0.17	0.62
Octacosane	226.81	258.19	228.83	31.38	2.01
Triacontane	239.93	276.80	238.23	36.87	1.70

Table 4. Table 4 : Experimental values and prediction of the melting point for indicated molecules. The table compares the accuracy of our neural network model with the accuracy of regression models based on the topological indices and molecular structure.

Molecule	Experimental (^∘C)	Topological Indices [34] Prediction (^∘C)	Neural Network Prediction (^∘C)	Topological Indices [34] Absolute Deviation (^∘C)	Neural Network Absolute Deviation (^∘C)
4-methylnonane	-98.70	-95.15	-95.66	3.55	3.04
Dodecane	-9.58	-11.25	-10.44	1.67	0.86
2-methylundecane	-46.81	-47.85	-46.17	1.04	0.64
3-methylundecane	-58.00	-65.55	-56.74	7.55	1.26

Table 5. Table 5 : Experimental values and prediction of the kinematic viscosity at 20 °C and atmospheric pressure for indicated molecules. The table compares the accuracy of our neural network model with the accuracy of a model based on the free volume theory.

Molecule	Experimental (cSt)	Free Volume Theory Model [34] Prediction (cSt)	Neural Network Prediction (cSt)	Free Volume Theory Model [34] Absolute Deviation (cSt)	Neural Network Absolute Deviation (cSt)
Octane	0.65	0.72	0.77	0.07	0.12
Nonane	1.00	0.98	0.99	0.02	0.01
Decane	1.31	1.25	1.27	0.07	0.05
Undecane	1.56	1.77	1.60	0.21	0.05
Dodecane	1.96	1.97	2.01	0.01	0.05
Tridecane	2.48	3.12	2.49	0.64	0.01
Tetradecane	2.99	3.01	3.06	0.01	0.07
Pentadecane	3.78	5.50	3.74	1.73	0.04
Hexadecane	4.54	4.56	4.51	0.02	0.03

Table 6. Table 6 : Summary of results for all the physical properties analysed.

Physical Property	$N_{molecules}$	$R^{2}$	AAD
$T_{boil}$	188	0.992	1.74 ^∘C
$C_{molar}$	181	0.997	2.33 $J {(molK)}^{- 1}$
$T_{flash}$	21	0.999	1.61 ^∘C
$T_{melting}$	51	0.998	1.26 ^∘C
$p_{vapor}$	51	0.917	0.07 bar
$ν$	9	0.998	0.05 cSt

Equations22

x^{2}_{i}=\sigma\bigg{(}\sum_{j}w^{1}_{ij}x^{1}_{j}+w^{1}_{0i}\bigg{)}.

x^{2}_{i}=\sigma\bigg{(}\sum_{j}w^{1}_{ij}x^{1}_{j}+w^{1}_{0i}\bigg{)}.

x_{i}^{3} = j \sum w_{ij}^{2} x_{j}^{2} + w_{0 i}^{2} .

x_{i}^{3} = j \sum w_{ij}^{2} x_{j}^{2} + w_{0 i}^{2} .

\rm{Cost(W)}=\frac{1}{N}\sum_{i,j}\big{(}y^{[i]}_{j}-x^{3[i]}_{j}\big{)}^{2}.

\rm{Cost(W)}=\frac{1}{N}\sum_{i,j}\big{(}y^{[i]}_{j}-x^{3[i]}_{j}\big{)}^{2}.

x^{[n + 1]} = γ x^{[n]} + (1 - γ) f (x^{[n]}),

x^{[n + 1]} = γ x^{[n]} + (1 - γ) f (x^{[n]}),

\rm{Cost(W)}_{k}=\frac{1}{N}\sum_{i,j}q_{k,i}\big{(}y^{[i]}_{j}-x^{3[i]}_{j}\big{)}^{2},

\rm{Cost(W)}_{k}=\frac{1}{N}\sum_{i,j}q_{k,i}\big{(}y^{[i]}_{j}-x^{3[i]}_{j}\big{)}^{2},

lo g_{10} p = A - \frac{B}{C + T} .

lo g_{10} p = A - \frac{B}{C + T} .

γ = \frac{1}{Δ T} \int_{T_{min}}^{T_{max}} ∣ p_{exp} (T) - p_{model} (T) ∣ d T,

γ = \frac{1}{Δ T} \int_{T_{min}}^{T_{max}} ∣ p_{exp} (T) - p_{model} (T) ∣ d T,

(T_{min}, T_{max}) = \frac{B _{exp}}{A _{exp} - ( - 1.875 , 0.294 )} + C_{exp}

(T_{min}, T_{max}) = \frac{B _{exp}}{A _{exp} - ( - 1.875 , 0.294 )} + C_{exp}

δ^{2} = \int_{T_{min}}^{T_{max}} (p_{exp} (T) - p_{model} (T))^{2} d T

δ^{2} = \int_{T_{min}}^{T_{max}} (p_{exp} (T) - p_{model} (T))^{2} d T

σ^{2} = \int_{T_{min}}^{T_{max}} (p_{exp} (T) - \overline{p}_{exp})^{2} d T,

σ^{2} = \int_{T_{min}}^{T_{max}} (p_{exp} (T) - \overline{p}_{exp})^{2} d T,

\overline{p}_{exp} = \frac{1}{Δ T} \int_{T_{min}}^{T_{max}} p_{exp} (T) d T,

\overline{p}_{exp} = \frac{1}{Δ T} \int_{T_{min}}^{T_{max}} p_{exp} (T) d T,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Predicting physical properties of alkanes with neural networks

Pavao Santak

[email protected]

Gareth Conduit

Theory of Condensed Matter, Department of Physics, University of Cambridge, J.J.Thomson Avenue, Cambridge, CB3 0HE, United Kingdom

Abstract

We train artificial neural networks to predict the physical properties of linear, single branched, and double branched alkanes. These neural networks can be trained from fragmented data, which enables us to use physical property information as inputs and exploit property-property correlations to improve the quality of our predictions. We characterize every alkane uniquely using a set of five chemical descriptors. We establish correlations between branching and the boiling point, heat capacity, and vapor pressure as a function of temperature. We establish how the symmetry affects the melting point and identify erroneous data entries in the flash point of linear alkanes. Finally, we exploit the temperature and pressure dependence of shear viscosity and density in order to model the kinematic viscosity of linear alkanes. The accuracy of the neural network models compares favorably to the accuracy of several physico-chemical/thermodynamic methods.

keywords:

Fragmented data, Neural network, Lubricant, Alkane, Flash point

P Santak

1 Background

Lubricants are an important component in modern industry. They are used to reduce friction between surfaces, protect them from wear, transfer heat, remove dirt, and prevent surface corrosion to ensure the smooth functioning of mechanical devices. The demand for lubricants makes them an important economic component in oil and gas business, while their importance is only expected to grow. Even as we move towards a future in which fossil fuels will be a less significant source of energy, the lubricant market is expected to grow 111https://www.grandviewresearch.com/press-release/global-lubricants-market.

A typical lubricant product comprises mainly of base oil, which is a mixture of predominantly alkanes that have typically between 18 and 50 carbon atoms. To improve the performance of a base oil, various additives are introduced. Seven physical properties of prime importance for lubricant performance are: melting point, boiling point, flash point, heat capacity, vapor pressure, dynamic viscosity and density. Most individual alkanes with appropriate properties have never been isolated, so relatively little is quantitatively known about their performance. However, data for some alkanes’ experimentally determined values is available in TRC Thermodynamic Tables: Hydrocarbons volumes [1] or in the DIPPR 801 database 222https://www.aiche.org/dippr/events-products/801-database.

Lubricants are made from readily available mixtures of predominantly alkanes so it’s not certain that current formulations are optimal. Predicting the physical properties of alkanes and understanding the link between alkane structure and lubricant performance would enable the computational design of an optimal base oil, which would motivate the distillation of base oil constituents to approach this optimum in practice.

The physical properties of alkanes that are relevant for base oil lubricant design have previously been modeled with a variety of semi-empirical methods. Wei explored the relationship between rotational entropy and the melting point [22], while Burch and Whitehead use a combination of molecular structure and topological indices to model the melting point of single branched alkanes with fewer than 20 carbon atoms [33]. To predict the normal boiling point of alkanes, Messerly et al. merged an infinite chain approximation and an empirical equation [17], while Burch, Wakefield, and Whitehead [29] used topological indices and molecular structure to model it for alkanes with fewer than 13 carbon atoms and Constantinou and Gani [30] developed a novel group contribution method to calculate it for various organic compounds. The semi-empirical Antoine equation is frequently used to model the vapor pressure as a function of temperature. Mathieu developed a group contribution based method to calculate the flash point of various alkanes [31], while Ruzicka and Domalski estimated the heat capacity of various liquid alkanes using a second order group additivity method [32]. De La Porte and Kossack have developed a model based on free volume theory to study long chain linear alkane viscosity as a function of temperature and pressure [34], Riesco and Vesovic have expanded a hard sphere model to study similar systems [35], and Novak has established a corresponding-states model to study viscosity of linear alkanes for the entire fluid region [36].

Purely empirical approaches have also been used in order to predict physical properties of alkanes. For example, Marano et al. develop an empirical set of asymptotic behavior correlations to predict the physical properties of a limited family of alkanes and alkenes [13],[14],[15]. Alqaheem and Riazi, and Needham et al. have explored correlations between different properties [2],[19] to predict the missing values.

While all of these approaches have their own merits, they cannot address the full range of alkanes, as they have a limited range of validity. To accurately predict physical properties for a wide range of alkanes we propose to exploit property-property correlations, molecular structure-property correlations, and semi-empirical equations. Unfortunately, the data set of physical properties of alkanes is fragmented, so to learn the property-property correlations, we need a statistical method that can impute the missing values. One such method is a principle component analysis (PCA) [24], but it delivers accurate results only when variables of interest are linearly correlated. Gaussian processes [25] is another common approach to handle fragmented data, but they are prohibitively expensive on large datasets and frequently predict large uncertainties for data that is vastly dissimilar to training data, which limits their extrapolative power.

There is another statistical tool that we could use to predict physical properties of alkanes, artificial neural networks [5],[21] (ANN) Figure 2. ANN’s have undergone rapid development in the last few years, finding applications from image recognition to digital marketing. They have also successfully been used to model physical properties of various organic compounds. For example, Suzuki, Ebert and Schüürmann used physical properties and indicator variables for functional groups to model viscosity as a function of temperature for 440 organic liquids [37] Ali implemented a conceptually similar approach to model vapor pressure as a function of temperature for various organic compounds [38]. Hosseini, Pierantozzi and Moghadasi, on the other hand use pressure, pseudo-critical density, temperature and molecular weight as neural network inputs to model dynamic viscosity of several fatty acids and biodiesel fuels as a function of temperature [39].

Unfortunately, while they are a powerful statistical tool, artificial neural networks previously used to model physico-chemical and thermodynamic properties of organic compounds are not able to handle fragmented data, which limits their applicability to model physical properties of alkanes. However, the neural networks described in Refs. [6], [26], [27], [40], [41] can be trained and run with fragmented data, which enables us to exploit property-property correlations even when data is fragmented. This novel neural network formalism has been used to discover two nickel-based alloys for jet engines [27], and two molybdenum alloys for forging hammers [26], as well as for imputing and finding errors in databases, with over a hundred errors discovered in commercial alloy and polymer databases [6]. It has also been applied for imputation of assay bioactivity data [40]. These ANN’s serve as a holistic prediction tool for the physical properties of alkanes, enabling us to exploit the property-property correlations, impute the missing values, and exploit the correlations between molecular structure and physical properties.

In section 2, we present theory of these neural networks, describe an algorithm to generate the molecular basis, and outline a statistical scheme to identify the most accurate neural network model. In section 3, we apply this formalism to predict the physical properties of linear and branched alkanes: in subsection 3.1, we predict the boiling point and the heat capacity of light branched alkanes; in subsection 3.2, we predict the vapor pressure of light branched alkanes as a function of temperature; in subsection 3.3, we predict the flash point of linear alkanes and identify erroneous experimental entries; in subsection 3.4, we predict the melting point of light branched alkanes and explore physical effects of symmetry and in subsection 3.5 we predict the kinematic viscosity of linear alkanes by exploiting the temperature and pressure dependence of their dynamic viscosity and density. Finally, we summarize our findings in section 4. We compare the accuracy of neural network models to competing physico-chemical/thermodynamic methods that have been used to model the same properties on similar systems. We determine the accuracy of our models through the coefficient of determination ( $\textit{R}^{2}$ ) and average absolute deviation (AAD). We decide to use ( $\textit{R}^{2}$ ) due to its invariance under the shift in data and data rescaling, which is a very useful property for problems in which the neural networks are used, while we chose AAD due to its simplicity and interpretability.

2 Theory

2.1 Molecular basis

The correlation between molecular structure and physical properties is the backbone of modeling physical properties of alkanes. To exploit these correlations we define a molecular basis that uniquely encodes the structure of every linear, single branched, and double branched alkane into five nonnegative integers. After representing each alkane as a two dimensional graph (Figure 1), these five basis set parameters are:

The number of carbon atoms. 2. 2.

The smaller number of C-C bonds between the end of the longest carbon chain and its closer branch. 3. 3.

The number of C-C bonds in the branch closer to an end of the longest carbon chain. 4. 4.

The number of C-C bonds between the other end of the longest carbon chain and its closer branch. 5. 5.

The number of C-C bonds in the second branch.

If an alkane has a single branch, the last two basis elements are 0. If an alkane is linear, only the first element is nonzero. This allows the basis set to smoothly pass from straight chain to single to double branched alkane.

2.2 Neural networks

Neural networks are a versatile modern statistical tool. They are a universal function approximator [11] that can recognize patterns that other statistical methods miss. In this section, we describe the theory of the neural networks that we use to predict the physical properties of alkanes. We first present a standard neural network in Figure 2, and then we describe the modifications that enable us to handle fragmented data.

The standard building block of a neural network is called a node. Each node represents a variable. Nodes are arranged in three types of layers. Every node is denoted by $x^{i}_{j}$ , where $x$ is the variable, $i$ is the layer index, and $j$ is the node index. The first layer is called an input layer, comprising the descriptor variables. The second layer is called the hidden layer, and its elements are nonlinear functions of linear combination of input nodes,

[TABLE]

In the above equation, $w^{1}_{ij}$ are called the weights and $\sigma$ is a nonlinear function, commonly known as a transfer function. Our neural networks use $\rm\sigma(x)=tanh(x)$ as the transfer function.The third layer is the output, its elements are linear combinations of nodes in a hidden layer and they represent the estimators for variables of interest,

[TABLE]

We train the neural networks by minimizing the cost function

[TABLE]

In the above equation, $[i]$ denotes the $i^{\rm{th}}$ example in the training data while ${j}$ denotes the $j^{\rm{th}}$ variable, $y$ denotes the training example, $x^{3}$ denotes the prediction and N denotes the number of training examples. This form of cost function is called the mean squared error cost function. There are other several other cost functions, such as the mean absolute error cost function, the cross-entropy cost function or the root mean square error cost function [21]. Minimizing the root mean square error cost function is equivalent to minimization of the mean square error cost function, mean absolute error doesn’t have a unique minimum and is used to promote sparsity of the weight matrix, while the cross-entropy cost function is used in classification problems. Modelling physical properties of alkanes is a non-sparse regression problem so the mean square error cost function is an appropriate one to use. Cost function is minimized iteratively by varying the weight matrix $W$ to yield the best model for the training data. To minimize the cost function, we first normalise the data, before we perform a random walk in the weight matrix space until convergence. Some of other commonly used algorithms are the gradient descent and the stochastic gradient descent [21], but we have found that the random walk algorithm with a predetermined expected move acceptance probability of 20% is as accurate and faster than other commonly used algorithms.

2.3 Handling sparse data

The neural networks described above can exploit the correlations between molecular structure and physical properties, but they cannot train from sparse data, as they require all the inputs to give an output. To exploit the property-property correlations we use physical properties as both inputs and outputs of the neural network model, which requires two changes to its architecture. Firstly, during neural network training, the weights $w^{2}_{ii}$ are set to zero to ensure property predictions are independent of the original value. Secondly, after the network is trained, we replace the missing values with the mean property values and recursively apply the following equation:

[TABLE]

where $n$ denotes the iteration step, $f(x)$ is a prediction for $x$ obtained from the neural network, and $\gamma\in[0,1]$ is a mixing parameter. In this manuscript, we use $\gamma=\frac{1}{2}$ . We apply the above equation until convergence, before we apply function $f$ again. A schematic of the data imputation algorithm is shown in Figure 3. To predict the mean and the uncertainty in the physical property of interest, we train and run six neural networks in parallel, assigning random weights to each data entry for each neural network model. For each neural network $k$ , the cost function then takes the following form:

[TABLE]

where $\sum_{i}q_{i}=1$ .

2.4 Cross validation

Training error, which we measure through the coefficient of determination ( $\textit{R}^{2}$ ) and average absolute deviation (AAD), is a poor indicator of neural network’s predictive power, as it underestimates the true error in neural network models. To obtain a better estimate of the neural network model accuracy, we perform cross-validation by splitting the full data set into the training set and the validation set. We use a scheme called leave-one-out cross-validation, in which we train the neural network on all but one data entry in a dataset before we test it against the remaining entry. We repeat this process until neural network has been tested against every entry in a dataset.

We also use the leave-one-out cross-validation to determine the optimal number of hidden nodes for our neural network. We train neural networks with different number of hidden nodes, perform cross-validation for each of them, and choose an architecture that has the smallest cross validation error. We illustrate this procedure in Figure 4 by determining the optimal neural network architecture for predicting boiling point of straight, single-branched and double-branched alkanes. In this case, the training error $\textit{R}^{2}$ increases as a function of number of hidden nodes, but the cross-validation error is the smallest for the neural network model with 6 hidden nodes. Too few hidden nodes are unable to properly capture the behavior while too many hidden nodes overfits the training data.

3 Results and discussion

In this section, we apply the formalism presented in section 2 to predict the physical properties of alkanes. We first predict the boiling point and the heat capacity of branched alkanes. Then we predict the Antoine coefficients to model the vapor pressure as a function of temperature. We also identify erroneous data in the flash point data from the literature, predict the flash point and establish the connection between the number of molecular symmetries and the melting point. Finally, we exploit the temperature and pressure dependence of dynamic viscosity and density and predict the kinematic viscosity of linear alkanes as a function of temperature.

We work with linear alkanes up to tridecane and with branched alkanes with fewer than 13 carbon atoms. Our data set comprises of experimental values obtained from various online sources ([9], [10], [18]), as well as experimental values presented in previous research papers ([3], [4], [7], [8], [12], [16], [20], [23]) and the TRC Thermodynamic Tables [1].

3.1 Boiling point and heat capacity

Predicting the boiling point of alkanes is an important step in determining their suitability for use in base oils, as alkanes with higher boiling points stay liquid at higher temperatures. We predict the normal boiling point of branched alkanes with fewer than 13 carbon atoms by training a neural network on an dataset comprised of 188 alkanes [1] with molecular basis as the input nodes and 6 hidden nodes (subsection 2.4), obtaining a cross-validation $\textit{R}^{2}=0.992$ and an AAD of 1.74*∘*C, indicating an excellent fit. The high quality of fit to experimental data can be seen in Figure 7.

After establishing the accuracy of the neural network, we compare our results to two regression models that use molecular structure and topological indices as inputs [29]. We compare three models for 62 alkanes whose boiling point all three models predict. Our neural network model outperforms both alternative models (Table 1). Apart from showing improved accuracy, our neural network model shows greater consistency than two competing models, as the standard deviation in absolute error is 1.43*∘C, while the standard deviation in absolute error of model 7.2 is 4.37∘C and 4.75∘C for model 7.3. A parity plot is shown in Figure 5. There are several molecules that both models 7.2 and 7.3 mispredict by a significant margin. Absolute deviations for the boiling points of 3-ethyl-2-methyl-heptane, 3-ethyl-3-methyl-pentane and 3-ethyl-3-methyl-heptane and are 19.4∘C, 19.5∘C and 24.3∘C for model 7.2 and similar for model 7.3, while they are 0.44∘C, 0.48∘C and 3.03∘*C for the neural network model.

Focusing only on alkanes with five or more carbon atoms, we observe that average absolute deviation for structural isomers decreases with increasing molecular weight Figure 6. A decreasing average absolute deviation, as well as greater accuracy and consistency of our predictions compared to other models means that our neural network model can be used to predict the boiling point of alkanes whose boiling point hasn’t yet been experimentally measured with higher confidence.

We observe that adding a branch but keeping the molecular weight constant decreases the boiling point by about 7∘C. Increasing the length of the branch while keeping molecular weight constant reduces the boiling point by about 2∘C, while moving the branch by an atom along the longest chain reduces it by about 2∘C.

We also predict the molar heat capacity of branched alkanes with fewer than 13 carbon atoms at 25∘C. The larger the molar heat capacity, the more energy an alkane can absorb and transport without a change in temperature, making it more suitable for use in lubricant base oils.

After applying the same neural network architecture that we used to predict the boiling point to a dataset comprised of 176 alkanes, we obtain a cross-validation $\textit{R}^{2}=0.997$ , showing an excellent fit. Our dataset doesn’t include methane, ethane, propane, butane and 2-methylbutane, as they are not liquids at 25∘C. We can see the quality of fit for some of our predictions in Figure 10.

We compare the quality of predictions from the neural network model to those from a model based on second order group additivity [32]. The neural network model outperforms the second order group additivity method, giving an AAD of 2.10 $\rm{J(molK)^{-1}}$ (Table 2). Our model also exhibits greater consistency than the second order group additivity method. Standard deviation in the absolute deviation of our neural network models is 2.04 $\rm{J(molK)^{-1}}$ , compared to 2.87 $\rm{J(molK)^{-1}}$ for the second order group additivity method. We show a parity plot for both models in Figure 8.

We also investigate the accuracy of our models as a function of carbon atoms for all the alkanes with more than 5 and fewer than 13 carbon atoms Figure 9. Unlike for the boiling point, we do not observe decrease of average absolute deviation with increase in molecular weight. While the average absolute deviation is the smallest for the structural isomers of dodecane, it is the largest for the isomers of nonane. Nonetheless, increased accuracy and consistency of our model result in higher confidence in using neural networks to predict molar heat capacity of alkanes whose heat capacity is unknown. Our results indicate that the molar heat capacity is approximately an increasing linear function of number of carbon atoms, while the effects of adding a branch, increasing its length or moving it along the longest carbon chain are negligible.

3.2 Vapor pressure

Vapor pressure is an important indicator of alkane’s volatility, since higher vapor pressure means that an alkane has a higher boiling point at fixed external pressure so will be more stable in an engine. To model vapor pressure as a function of temperature, scientists first record vapor pressure at various temperatures before they fit it to the Antoine equation and determine the coefficients of Antoine equation:

[TABLE]

Coefficients $A$ and $B$ arise from the solution to the Clausius-Clapeyron relation in an ideal gas approximation, while coefficient $C$ is empirical and captures the temperature dependence of latent heat. Temperature $T$ is measured in ∘C. Experimentally deduced values for $A$ , $B$ and $C$ (51,72,72 data entries coming from [1]) in our database give an accurate description of vapor pressure’s temperature profile between temperatures at which $\log_{10}p=-1.875$ and $\log_{10}p=0.294$ , with pressure measured in bars.

To determine the coefficients $B$ and $C$ , we train a neural network with 6 hidden nodes for each coefficient. Then, we use the results for $B$ , $C$ and the boiling point at atmospheric pressure (when $\log_{10}p=0$ ) in order to calculate $A$ . We use molecular basis as our input nodes and obtain a cross-validation $\textit{R}^{2}=0.974$ for $B$ , $\textit{R}^{2}=0.962$ for $C$ , and $\textit{R}^{2}=0.958$ for $A$ . We also obtain an AAD of 0.008 bar for $A$ , 19.81 bar*∘C for $B$ and 1.31∘*C for $C$ . Adding a branch decreases $B$ by about 36 bar∘C (Figure 11), adding a branch and keeping molecular weight constant increases $C$ by 3.5∘C, while extending a branch or moving it along a longest chain is negligible. Antoine $A$ coefficient is approximately constant for all the alkanes, which is consistent with the Clausius-Clapeyron equation, in which $A$ arises as an integration constant. We observe that adding a branch while keeping the molecular weight constant increases the vapor pressure, while extending it and keeping the molecular weight constant or moving it along the longest carbon chain further increases it by a smaller amount than adding a branch.

We use neural network predictions for Antoine coefficients to calculate the vapor pressure as a function of temperature and compare to experimental results. Since vapor pressure is a continuous variable, we use two replacement metrics instead of AAD and $\textit{R}^{2}$ to determine the accuracy of our model. To calculate the first metric, we first calculate the following quantity:

[TABLE]

before we average it over all the molecules. Note that $\gamma$ is the average absolute deviation of the vapor pressure over the considered temperature range, while $T_{\rm min}$ and $T_{\rm max}$ are calculated via the following relation:

[TABLE]

Instead of the coefficient of determination, for each molecule we first calculate the following two quantities:

[TABLE]

and

[TABLE]

where

[TABLE]

before we calculate the substitute metric as $1-(\sum_{i}\delta_{i}^{2})/(\sum_{i}\sigma_{i}^{2})$ . This metric tells us how much better our model is compared to a model in which we use an average value of the vapor pressure to model it over an entire temperature range. The value of the former metric is 0.069 bar with a standard deviation of 0.085 bar, while the value of the latter metric is 0.917, indicating a good fit.

3.3 Flash point

Flash point is the smallest temperature at which a substance spontaneously ignites in the presence of fire. Predicting it enables us to identify temperatures for lubricant storage and handling.

We study the flash point of linear alkanes with fewer than 31 carbon atoms. We collected experimental data from two online sources [18], [9]. After training a neural network with two hidden nodes, we obtain a cross-validation $\textit{R}^{2}=0.910$ . However, we can use neural networks to improve the prediction accuracy. We identify the data entries that lie more than 2 standard errors away from the expected value. In Figure 12, we see that alkanes that have between twenty and twenty-seven carbon atoms are multiple standard errors away from mean predictions and appear anomalous. After tracking down the original sources of this data [9], we found that the entries for alkanes from eicosane up to hexacosane are indeed incorrect. We further validate this claim by investigating the correlation between the flash point and the boiling point. It is empirically true that flash point is linearly correlated with the boiling point for hydrocarbon compounds [2], and linear alkane data entries fit this trend.

After removing erroneous data entries, we predict flash point again and obtain a cross validation $\textit{R}^{2}=0.999$ , which would allow the model predictions to replace experimental measurements. We also compare our predictions to those made by a group contribution method presented in [31]. The neural network model reproduces experimental flash point with an AAD of 1.65 $\rm{{}^{\circ}C}$ , compared to an AAD of 8.08 $\rm{{}^{\circ}C}$ predicted by a group contribution method (Table 3). Our model gives a more accurate prediction for 16 out of 21 alkanes used to build a model. In addition, our model shows greater consistency than the group contribution model. In particular, the model in [31] is far less accurate for the several smaller molecules such as ethane and propane, as well as octacosane and triancontane, whose flash point is mispredicted by over 30*∘*C, while the accuracy of our model roughly consistent for all the data entries (Figure 13).

3.4 Melting point

Accurate predictions of the melting point reveals whether an alkane will solidify at lower temperatures of the lubricant’s operating range. We study the melting point of branched alkanes with fewer than 13 carbon atoms and train a neural network with 5 hidden nodes and just the molecular basis as input nodes. Our dataset consists comprises 51 molecules, whose melting point was experimentally measured [1]. After training a neural network model and cross-validating it we obtain an $\textit{R}^{2}=0.650$ . This poor reproduction of experimental data motivates us to search for additional physical correlations to improve the fidelity of the neural network predictions.

We have identified two additional effects that affect the melting point. Firstly, if the number of carbon atoms in the longest carbon chain is an even number, an alkane has a higher melting point than if the number of carbon atoms is odd. Secondly, an alkane with a higher number of molecular symmetries has a higher melting point. This effect is readily observed in isomers of pentane [22]. Pentane has 4 molecular symmetries and a melting point of -129.9∘C, 2-methylbutane has 2 molecular symmetries and a a melting point of -159∘C, while 2,2-dimethylpropane has 24 molecular symmetries and a melting point of -16.6∘C. Therefore, we add two more elements to the input layer of the neural network that we use to predict the melting point; one to capture the odd/even effect, and the second being the total number of symmetries.

With these two additional chemical descriptors, we train and cross-validate a new neural network with a cross-validation, obtaining $\textit{R}^{2}=0.998$ (Figure 14). We also compare our results to results obtained by several regression models that use molecular structure and topological indices as inputs [33] for 4 molecules whose melting point is common to both datasets (Table 4). Neural network model shows greater accuracy, as it reproduces experimental values with an AAD of 1.45*∘, compared to an AAD of 3.45∘* reproduced by models 4.1 and 4.2 in [33].

The significant improvement in accuracy of our model upon the introduction of the number of symmetries serves as a further indicator of the importance of the molecular symmetry on the melting point of alkanes. Looking forward, to further improve the accuracy of the predictions, one would also include the details of alkanes’ crystalline structure.

3.5 Kinematic viscosity

Dynamic viscosity ( $\mu$ ) is a measure of a fluid’s resistance to an external force. The ratio of dynamic viscosity and density ( $\rho$ ) gives the kinematic viscosity, a measure of fluid’s flow properties. Predicting an alkane’s kinematic viscosity at 40∘ and 100∘C enables us to calculate its viscosity index (VI)[28], frequently used in industry as a measure of temperature gradient of kinematic viscosity.

We study the kinematic viscosity of the linear alkanes from heptane to heneicosane. We first train a neural network to predict dynamic viscosity and density using an experimental database assembled by merging experimental data for shear viscosity and density as a function of temperature and pressure obtained from various research papers ([3], [4], [7], [8], [12], [16], [20], [23]) into a single dataset. Our dataset for density has 537 data entries while our dataset for viscosity has 638 data entries. We then take the ratio of the predictions to determine the kinematic viscosity and its uncertainty.

Data for dynamic viscosity and density as a function of temperature and pressure is fragmented, so we need more input parameters for the neural network models. To predict density and dynamic viscosity at 40∘C and 100∘C, we include the additional sparse data of density and dynamic viscosity at 25∘C and atmospheric pressure. For the branched alkanes with eight, nine and ten carbon atoms, the neural network model for density as a function of temperature increased in accuracy from $\textit{R}^{2}=0.412$ to $\textit{R}^{2}=0.840$ due to the inclusion of this additional information(Figure 15). Furthermore, focusing just on linear alkanes we obtain a cross-validation $\textit{R}^{2}$ of 0.998 for dynamic viscosity at 20°C and of 0.987 for density.

Next, we compare the values for kinematic viscosity at 20°C and atmospheric pressure obtained by our ANN to values obtained by a model based on free volume theory [34] (Table 5). The neural network model is more accurate than the free volume theory model, reproducing experimental data with $\textit{R}^{2}=0.998$ and an average absolute deviation of 0.05 cSt, compared to $\textit{R}^{2}=0.899$ and an average absolute deviation of 0.31 cSt predicted by the free volume theory model [34]. Furthermore, the neural network model shows greater consistency for the molecules analysed than the free volume theory model. Standard deviation in absolute deviation of the neural network model is 0.03 cSt compared to 0.57 cSt for the free volume theory model, with absolute deviations in kinematic viscosity of pentadecane(1.73 cSt) and tridecane(0.64 cSt) being particularly large (Figure 17).

Finally, we run our neural network model on density and dynamic viscosity at 40∘C and 100∘C and at atmospheric pressure to calculate the kinematic viscosity at 40∘C and 100∘C (Figure 16) at atmospheric pressure and then determine alkane’s viscosity index. The neural network model can provide insights into which linear alkanes could feature in a commercialized lubricant. Eicosane is the only linear alkane modelled here that has a value of kinematic viscosity at 100∘C above 2 $\rm{cSt}$ so it is the only linear alkane for which we can define a viscosity index. However, eicosane is a solid below 36∘C so could only be present in a base oil lubricant in relatively small amounts, as lubricants are usually expected to operate between -15∘C and 100∘C. Therefore, it is likely that linear alkanes are present in base oil lubricants only in relatively small amounts.

4 Conclusions

We have used artificial neural networks that exploit inter-property correlations to predict the physical properties of alkanes. The algorithm describes the molecular structure of linear, single, and double branched alkanes, and enables us to predict the boiling point, the heat capacity and the vapor pressure as a function of temperature. We also predicted the flash point of linear alkanes up to tridecane and identified erroneous experimental entries in the literature. The number of molecular symmetries correlates to the melting point. Finally, we have exploited the temperature and pressure dependence of dynamic viscosity and density alongside interproperty correlations across the temperature range to predict the kinematic viscosity at atmospheric pressure as a function of temperature. Values of physical properties reproduced with these neural networks are more accurate and consistent than the values reproduced by other methods. We present a summary of our results for the boiling point, the molar heat capacity, the Antoine coefficients, the flash point, the melting point and the kinematic viscosity in Table 6.

Our study serves as a solid platform from which to further investigate physical properties of alkanes. This generic neural network architecture could merge sparse experimental data with molecular dynamics simulations to predict physical properties of alkanes, particularly the intractable properties like shear viscosity and density, enabling us to identify the alkanes that could be components for lubricant base oils with superior physical properties.

Pavao Santak acknowledges financial support of BP-ICAM. Gareth Conduit acknowledges financial support from the Royal Society. Both authors thank Leslie Bolton, Corneliu Buda and Nikolaos Diamantonis for useful discussions. There is an Open Access at https://www.openaccess.cam.ac.uk.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] American Petroleum Institute. Research Project 44 and Texas Engineering Experiment Station. Thermodynamics Research Center. TRC Thermodynamic Tables: Hydrocarbons . Thermodynamics Research Center, Texas Engineering Experiment Station, Texas A & M University System, 1986.
2[2] Sara S. Alqaheem and M. R. Riazi. Flash points of hydrocarbons and petroleum products: Prediction and evaluation of methods. Energy & Fuels , 31(4):3578–3584, 2017.
3[3] M. J. Assael and M. Papadaki. Measurements of the viscosity of n-heptane, n-nonane, and n-undecane at pressures up to 70 M Pa. International Journal of Thermophysics , 12(5):801–810, Sep 1991.
4[4] Hseen O. Baled, Dazun Xing, Harrison Katz, Deepak Tapriyal, Isaac K. Gamwo, Yee Soong, Babatunde A. Bamgbade, Yue Wu, Kun Liu, Mark A. Mc Hugh, and Robert M. Enick. Viscosity of n-hexadecane, n-octadecane and n-eicosane at pressures up to 243M Pa and temperatures up to 534K. The Journal of Chemical Thermodynamics , 72:108 – 116, 2014.
5[5] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics) . Springer-Verlag, Berlin, Heidelberg, 2006.
6[6] P C. Verpoort, P Mac Donald, and G Conduit. Materials data validation and imputation with an artificial neural network. 147, 02 2018.
7[7] D. R. Caudwell, J. P. M. Trusler, V. Vesovic, and W. A. Wakeham. The viscosity and density of n-dodecane and n-octadecane at pressures up to 200 M Pa and temperatures up to 473 K. International Journal of Thermophysics , 25(5):1339–1352, Sep 2004.
8[8] Derek R. Caudwell, J. P. Martin Trusler, Velisa Vesovic, and William A. Wakeham. Viscosity and density of five hydrocarbon liquids at pressures up to 200 M Pa and temperatures up to 473 K. Journal of Chemical & Engineering Data , 54(2):359–366, 2009.