From Local Atomic Environments to Molecular Information Entropy
Alexander Croy

TL;DR
This paper explores how molecular complexity can be measured using information entropy derived from atomic environments, showing its usefulness in comparing molecules.
Contribution
The paper introduces a novel method to compute molecular entropy based on similarity matrices derived from atomic environments.
Findings
Two similarity definitions (SMILES and SOAP kernel) produce comparable entropy measures when sensitivity is adjusted.
Entropy gain from mixing molecules can serve as a similarity measure, aligning with other kernel-based methods.
Abstract
The similarity of local atomic environments is an important concept in many machine learning techniques, which find applications in computational chemistry and material science. Here, we present and discuss a connection between the information entropy and the similarity matrix of a molecule. The resulting entropy can be used as a measure of the complexity of a molecule. Exemplarily, we introduce and evaluate two specific choices for defining the similarity: one is based on a SMILES representation of local substructures, and the other is based on the SOAP kernel. By tuning the sensitivity of the latter, we can achieve good agreement between the respective entropies. Finally, we consider the entropy of two molecules in a mixture. The gain of entropy due to the mixing can be used as a similarity measure of the molecules. We compare this measure to the average and best-match kernel. The…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6- —Deutsche Forschungsgemeinschaft10.13039/501100001659
- —Volkswagen Foundation10.13039/501100001663
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Crystallography and molecular interactions
Introduction
1
The concept of similarity is directly linked to the complexity of an object: decomposing the object into different, distinguishable units, the number of those units provides a measure of its complexity.^1^ In this sense, the complexity, or information content, of a molecule can be defined, and over the past decades, different measures have been proposed, e.g.^2−6^ Unfortunately, the different complexity measures are often hardly comparable. In this article, we establish a connection between the information entropy (in the sense of Shannon^7,8^) and the similarity matrix constructed from the local atomic environments of a molecule. The resulting expression is analogous to the von-Neumann entropy^9^ and can be used as a general framework for quantifying molecular complexity.
Similarity also plays an important role in many machine learning techniques, like kernel-ridge regression (KRR) or Gaussian process regression (GPR).^10^ In combination with descriptors of local atomic environments,^11,12^ those methods have been very successful in different areas of computational chemistry and material science.^13^ The atomic environments are usually defined in terms of all atoms within a certain distance of a reference atom. Suitable descriptors are then found, for example, from an expansion of the density of atoms in the environment into radial basis functions and spherical harmonics, i.e. using a smooth overlap of atomic positions (SOAP).^14^ The kernel function entering the KRR or GPR is calculated from such descriptors and can be interpreted as a measure of the similarity of two local atomic environments. Comparing all pairs of environments in this way leads to the similarity matrix of the specific molecule. On the other hand, for learning and predicting the global properties of molecules, one can introduce the similarity between different molecules.^15^ The latter can also be constructed from the similarity matrix of the local atomic environments.^16^
In order to calculate the molecular information entropy, we present two specific choices for defining a similarity function of local atomic environments: one is based on a graph representation of the molecule, choosing substructures around a reference atom and comparing the resulting SMILES strings.^17,18^ The second approach facilitates the aforementioned SOAP similarity kernel and thus uses the positions and atomic numbers of the atoms. Both approaches are suitable for automatized computational studies, and we present results for a selection of molecules from the QM9 data set.^19,20^
Finally, we investigate the information entropy for pairs of molecules, which leads to the mixing entropy. The latter is the maximal gain of information entropy upon mixing two molecules.^6^ Based on this observation, we propose and construct a new similarity measure of molecules and compare it to previously studied kernels.^16^ Our results demonstrate the usefulness and broad applicability of the similarity-based entropy approach.
Methods
2
Information Entropy
2.1
Typically, information entropy is considered in contexts involving some kind of “experiment” or “process”, where each time one of the events A1, A2, ..., An occurs at random.^21^ Knowing the probabilities p1, p2, ..., pn for those events, one can characterize the amount of uncertainty about the outcomes by introducing the Shannon entropy(7,8) according to
The logarithm can be taken with respect to any base, but it is usually assumed to be base two. Moreover, we take pi log pi = 0 if pi = 0. One readily sees that the entropy vanishes if the probability for one event is one and the others are zero accordingly. This describes an experiment with no uncertainty because the outcome would always be the same. On the other hand, one has maximum uncertainty and thus maximal entropy if all events have the same probability pi = 1/n. In this case, H = log n.
In the context of molecules and graphs,^1,22^ a different point of view might be more suitable. If we decompose the molecule (or a graph) into n different parts (e.g., its atoms or vertices) and assign each of them with one of the equivalence classes (e.g., atom types) A1, A2, ..., and Am (m ≤ n), then we can construct a finite scheme by associating the probability pi = ni/n to the respective class. The number of parts we found for each class is denoted by ni, i.e., ∑ i = 1^m^ni = n. The entropy given by eq 1 can be viewed as a measure of the complexity of the object. If all parts belong to the same class (n1 = n), then the complexity is zero. Conversely, if all parts belong to a different class (ni = 1 and n = m), then the complexity is maximal for that particular system.
Information
Entropy from Similarity
2.2
To obtain a connection between information entropy of a molecule and the similarities of its atoms, we start from a similarity function as follows,
If two atoms k and l in the molecule are chemically, or otherwise, equivalent, the function yields 1, and otherwise it gives 0. Using the similarity function for all pairs of atoms yields the similarity matrix of the molecule. This matrix is symmetric and positive semidefinite, and its trace equals the number of atoms in the molecule.
By permuting rows and columns of a similarity matrix of size n × n arising from eq 2, it can always be written as a direct sum of matrices of ones, i.e., S ≃ 1n1 ⊕ 1n2 ⊕ ⋯ ⊕ 1nm__. The similarity matrix becomes block-diagonal with m blocks of size ni × ni, respectively. Consequently, it has m nonzero eigenvalues, namely {n1, n2,..., nm} and n – m eigenvalues which are zero1. This suggests that we can obtain a finite scheme directly from the similarity matrix since the nonzero eigenvalues of S divided by n yield the required probabilities pi = ni/n. Moreover, we can directly calculate the associated entropy
Here, Tr denotes the trace of the matrix and log is the matrix logarithm. The last expression is analogous to the von-Neumann entropy,^9^ which is not only used in quantum mechanics, but also in the context of complex networks.^23^ This analogy suggests that we can generalize the similarity function eq 2 to any positive semidefinite and symmetric function with a value range 0 ≤ S ≤
- The expression for the entropy remains unchanged.
We can also introduce the linear entropy by expanding the logarithm to the lowest order:
It is given in terms of the average of the squared elements of the similarity matrix. If the latter contains only zeros and ones, the latter average is equal to the average of the elements themselves. While it is approximate, this expression can be easier to calculate, especially for large similarity matrices.
Of course, the main question is how to find a suitable similarity function. From a chemical point of view, one might have different options for specifying chemical equivalence^3,5^ or one might use graph-theoretic concepts.^4,22^ In the following, two approaches are presented and discussed, which are also motivated by applicability in computational settings.
Substructure-SMILES
Similarity
2.3
As a first approach to find the similarity of atomic environments, we used a graph representation of the molecule under consideration. For each atom of the molecule, we select a subgraph which involves all atoms which are connected to the reference atom by at most N bonds. For example, N = 1 would entail the atom itself and the bonded neighbors. Each subgraph is then converted to a (canonical) SMILES string^17,18^ with the reference atom as starting point. In practice, we use rdkit^24^ to generate the environments and the SMILES strings. The similarity function is then simply defined via the comparison of the respective SMILES strings of the subgraphs, i.e.,
Instead of exact string comparison, one can also utilize any other string or SMILES similarity method.^25^
Figure 1 illustrates the atomic environments of one of the carbon atoms in ethanol for different values of N. For N = 0, the SMILES strings only contain the element symbol of the respective atom, and therefore all hydrogen atoms, and all carbon atoms are found to be identical (lower right block in the similarity matrix). Already for N = 1, this is no longer the case as the environments take neighboring atoms into account, and the hydrogen atom bound to the oxygen is considered as distinct from the other hydrogens. The similarity matrices show increasing differentiation between the different atoms until convergence is reached for N ≥ 2 in this case.
Top row: Atomic environments for one of the carbon atoms in ethanol with N = 0, 1, and 2 (from left to right). The substructure-SMILES strings are given below the structures. Bottom row: Similarity matrices obtained from eq 5 for N = 0, 1, 2.
SOAP
Similarity
2.4
Another strategy for defining local atomic environments is the SOAP approach.^14,16^ It is based on the representation of the local density of atoms in the vicinity of a central atom. The corresponding environment is defined by a cutoff radius. Expanding the density in terms of radial basis functions gn(r) and spherical harmonics Ylm(r̂) leads to
Here, r denotes the length and r̂ the direction of the position vector , and Z indicates the atomic species. The expansion coefficients yield the rotationally invariant partial power spectrum,
The elements can be collected into an unit length vector which is then used to define the similarity function for two atomic environments,
It can be shown that SSOAP is a positive definite function, which is obviously symmetric with respect to k and l. The integer exponent ζ ≥ 1 can be used to increase the sensitivity of the similarity function.^14^ In the definition above, we have additionally enforced the dissimilarity of environments centered around different species.
Results and Discussion
3
Molecular
Information Entropies
3.1
First, we consider the molecular information entropies calculated from the substructure-SMILES similarity approach. To this end, we collected 13 small molecules for which the information entropies are known based on symmetry and chemical intuition.^26,27^Figure 2 shows a comparison of the entropies obtained from eq 3 for different sizes N of the atomic environment. Consistent with the observation in Section 2.3, the entropy increases with an increasing N for some molecules until it converges to the expected value. In those cases, larger environments are necessary to achieve sufficient differentiability between all of the environments (as shown in Figure 1 for CCO).
Molecular information entropies HSMILES calculated from the substructure-SMILES similarity approach for the molecules denoted on the upper axis. The values are obtained for different sizes of the atomic environments (N = 0, 1, 2, 3) and compared to values from literature (topmost row).26,27
In the next step, we use the ground state geometries of the 13 molecules mentioned above from the QM9 data set^19,20^ and calculate the SOAP descriptors for each atom using dscribe^28,29^ (with rcut = 6 Å, nmax = 10, and lmax = 6). The similarity matrix and entropy are then calculated via eqs 8 and3, respectively, for a given sensitivity exponent ζ. Figure 3 shows a comparison of the corresponding entropies with the entropies found from the substructure-SMILES similarity approach (with N = 3). As the sensitivity increases for larger ζ, the entropies also increase since the similarity function, further distinguishing the atomic environments. Apparently, the entropies do not converge in all cases to the SMILES entropies. The reason is that the functional form of the similarity function in eq 5 suppresses all entries which are ≤1 with increasing N, which leads in the extreme case to all environments being dissimilar and yields the maximum entropy.
Comparison of entropies obtained from substructure-SMILES (HSMILES) and the SOAP approach (HSOAP), respectively. The red circles indicate values from literature.26,27 The different plots show SOAP entropies with an increasing sensitivity exponent ζ. Perfect matching is indicated by the straight lines.
To further characterize the similarities arising from the two approaches, we consider the substructure-SMILES similarity as references and compute the Kullback–Leibler (KL) divergence^30^ for the SOAP similarity of the same molecule,
The KL divergence is a positive function and becomes zero if the two similarity matrices are identical. As an illustration, we take the first 184 molecules from the QM9 data set [10 are excluded due to SMILES mismatch between rdkit and QM9] and compute the average KL divergence for different sensitivity exponents. From Figure 4, one can see that the KL divergence initially decreases with increasing sensitivity, in accordance with our previous observations. Then, at ζ ≈ 64 it shows a minimum, implying that the respective SOAP similarities on average match best to the SMILES similarities. For larger exponents, the KL divergence increases slowly as more and more entries in SSOAP are vanishing. Figure 4 also shows the comparison of the entropies obtained from the SOAP similarities for ζ = 64 and the SMILES-based similarities. We observe good but not perfect agreement, which can again be attributed to the sweeping effect of the sensitivity exponent. It should be noted that the binary nature of the SMILES similarity is an extreme case that requires an atypical high sensitivity.
Left: Average Kullback–Leibler divergence ⟨D⟩ of the SMILES and SOAP similarity matrices of 174 molecules. Right: Comparison of entropies obtained from the substructure-SMILES (HSMILES) and the SOAP approach (HSOAP) near the minimum of ⟨D⟩ for 174 molecules. Perfect matching is indicated by a straight line.
Mixing
Entropy and Molecular Similarity
3.2
One important question concerns the complexity or information entropy of mixtures of molecules as this directly applies to chemical reactions.^26,27,31^ We take two molecules, and , and suppose that we have a similarity function S, which can be applied to any pair of atomic environments. The resulting similarity matrix has a 2 × 2 block structure with the diagonal blocks (SI and SII) referring to the similarity between parts of each individual molecule, and the off-diagonal blocks (SI,II = SII, I^T^) refer to similarity between environments belonging to different molecules.
If the two molecules are identical, all four blocks are equal to SI and the nonvanishing eigenvalues of SI,II become those of 2 SI. Since the total number of parts is 2nI, the entropy of the two molecules is the same as that for the individual molecule. In the other case, if the two molecules do not share any equivalent atomic environments, then the off-diagonal blocks are filled with zeros, and the similarity matrix can be written as a direct sum SI+II = SI⊕SII. In this case, the entropy becomes a weighted average of the individual entropies plus a term which can be called entropy of mixing,^26,27^
with
For nI = nII, the mixing entropy is equal to log 2 (or one bit per atom). For any pair of molecules, we can define the gain of entropy due to mixing via
In general, this quantity will have values between 0 and Hmix ≤ 1.
To illustrate the concept of mixing entropy, we take all pairs of the first 184 molecules from the QM9 data set and calculate the combined similarity matrices SI,II and the corresponding entropies H(SI+II). In Figure 5, the latter are compared to the mixing entropies Hmix(nI, nII) for the molecules using the substructure-SMILES approach to calculate the similarities (with N = 3). For the data points on the diagonal, the molecules do not share equivalent atomic environments and are ΔH = Hmix. However, there are a number of molecular pairs for which the entropy is smaller than the mixing entropy.
Comparison of the gain of entropy due to mixing of two molecules ΔHSMILES and the respective mixing entropy Hmix. The similarities are calculated using the substructure-SMILES approach (with N = 3). Perfect matching is indicated by the straight line.
The results for the mixing entropies suggest that one can use the ratio of the gain of entropy due to mixing and the mixing entropy Hmix(nI, nII) as a measure for similarity of two molecules. This ratio is zero for two identical molecules and one for molecules not sharing atomic environments. Other approaches to compare two molecules based on similarities of local atomic environments include the average structural kernel and the best-match structural kernel.^16^ The former is calculated by averaging the elements of the off-diagonal block in SI,II, i.e.,
The best-match kernel can be formulated in terms of a rectangular assignment problem,^16,32^
The matrix X contains only zeros and ones and additionally its elements are subject to ∑jXij = 1 ∀ i and ∑iXij ≤ 1 ∀ j. If NI > NII, the arguments and have to be interchanged. Here, we have generalized the two kernels by allowing arbitrary powers p of the entries in S to be used while in ref (16) one is summing the elements of the similarity matrix (p = 1) in both kernels. The reason for introducing p is given by the expression of the linear entropy in eq 4. There, only squared elements of the similarity matrix appear which suggests that taking the sum over the squared elements in K̅^(p)^ or K̂^(p)^, i.e. p = 2, is more natural when comparing to the entropy-based similarity as also shown below.
Taking again the pairs of the 184 molecules from before, we compute the different molecular similarities for the substructure-SMILES and the SOAP approaches and compare them in Figure 6. One sees that for the SMILES approach, the results for p = 1 and p = 2 are identical, which is expected since Sij = Sij^2^ in that case. On the other hand, for the SOAP approach, one sees a clear dependence on p. The best-match kernel with p = 2 yields similarities that agree on average with the entropy-based similarities. In contrast, the p = 1 kernel gives rise to a systematic nonlinear deviation. With the average kernels one obtains similarities that are rather different from the entropy-based measure. In either case, it appears that the kernels with p = 2 have a correspondence with the entropy-based similarities which is close to linear. The results show that the considered molecular similarity measures are related despite their different starting points.
Comparison of two molecular similarity kernels, K̅(p) and K̂(p), with a measure obtained via the gain of entropy due to mixing, 1 – ΔH/Hmix. Top row: Results for similarities were calculated from the substructure-SMILES approach with environment size N = 3. Bottom row: Results for similarities were calculated from the SOAP approach with sensitivity exponent ζ = 16. Perfect matching is indicated by the straight lines.
Conclusions
4
In summary, we have presented and discussed a connection between the information entropy of a molecule and the similarity matrix of its local atomic environments. The entropy can be obtained from an expression, given by eq 3, which is akin to the von-Neumann entropy. This relation provides a convenient framework for calculating the information entropy and also for analyzing its properties.
Two approaches for obtaining the similarity of the atomic environments were given. The substructure-SMILES method is based on a graph representation of the molecule. The similarity is defined in terms of a comparison of the SMILES strings for substructures. For a set of molecules, the entropies were shown to correspond to the known and expected values if the sizes of the substructures were sufficiently large. The second approach was based on SOAP descriptors, which were obtained from positions and atomic numbers of atoms in a molecule. The sensitivity of the resulting similarities can be tuned by an integer exponent ζ, and we found a good agreement between SOAP- and SMILES-based entropies for larger values of ζ. It should be noted that any value <1 in the similarity matrix is suppressed with increasing ζ, so that it is not expected that the values converge to the SMILES-based calculations. But the results show that both approaches can be used to estimate the information entropy of molecules. The choice of similarity function and the tuning of its hyperparameters (like ζ) should be adapted to the given application.
Finally, we investigated the entropy of a pair of molecules. For identical molecules, this entropy corresponds to one of the individual molecule. If the molecules do not share any similar atomic environments, the ensemble entropy becomes equal to a weighted sum of the individual entropies plus the mixing entropy, which only depends on the number of atoms in each molecule. In general, the entropy of the pair takes values between these two extremes. This was the motivation for defining the similarity of two molecules in terms of the entropy gain due to mixing. We compared this measure to two other similarity kernels (the average structural kernel and the best-match structural kernel) for 184 molecules and found, on average, very good agreement with a modified best-match kernel. Thus, the entropy-based molecular similarity provides an alternative measure for comparing molecules with a strong anchoring in the framework of molecular information entropies.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Rashevsky N. Life, information theory, and topology. Bulletin of Mathematical Biophysics 1955, 17, 229–235. 10.1007/BF 02477860. · doi ↗
- 2Bonchev D.; TrinajstićN. Information theory, distance matrix, and molecular branching. J. Chem. Phys. 1977, 67, 4517–4533. 10.1063/1.434593. · doi ↗
- 3Bertz S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 1981, 103, 3599–3601. 10.1021/ja 00402 a 071. · doi ↗
- 4Bonchev D.; TrinajstićN. Chemical information theory: Structural aspects. Int. J. Quantum Chem. 1982, 22, 463–480. 10.1002/qua.560220845. · doi ↗
- 5Böttcher T. An Additive Definition of Molecular Complexity. J. Chem. Inf. Model. 2016, 56, 462–470. 10.1021/acs.jcim.5b 00723.26857537 · doi ↗ · pubmed ↗
- 6Sabirov D. S.; Shepelevich I. S. Information Entropy in Chemistry: An Overview. Entropy 2021, 23, 124010.3390/e 23101240.34681964 PMC 8534366 · doi ↗ · pubmed ↗
- 7Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 379–423. 10.1002/j.1538-7305.1948.tb 01338.x. · doi ↗
- 8Shannon C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27, 623–656. 10.1002/j.1538-7305.1948.tb 00917.x. · doi ↗
