Rigidity strengthening is a vital mechanism for protein-ligand binding

Duc Duy Nguyen; Tian Xiao; Menglun Wang; Guo-Wei Wei

arXiv:1704.05883·q-bio.BM·April 21, 2017

Rigidity strengthening is a vital mechanism for protein-ligand binding

Duc Duy Nguyen, Tian Xiao, Menglun Wang, Guo-Wei Wei

PDF

Open Access

TL;DR

This paper demonstrates that protein rigidity strengthening and reduced flexibility are key mechanisms in protein-ligand binding, with implications for drug and protein design, and introduces a rigidity-based approach that outperforms existing scoring functions.

Contribution

The study reveals a long-range contribution of residue layers to binding and highlights the importance of short-range interactions, advancing understanding of binding mechanisms.

Findings

01

Rigidity strengthening is crucial in protein-ligand binding.

02

Residue layers contribute significantly to binding affinity.

03

The proposed rigidity-based method outperforms existing scoring functions.

Abstract

Protein-ligand binding is essential to almost all life processes. The understanding of protein-ligand interactions is fundamentally important to rational drug design and protein design. Based on large scale data sets, we show that protein rigidity strengthening or flexibility reduction is a pivoting mechanism in protein-ligand binding. Our approach based solely on rigidity is able to unveil a surprisingly long range contribution of four residue layers to protein-ligand binding, which has a ramification for drug and protein design. Additionally, the present work reveals that among various pairwise interactions, the short range ones within the distance of the van der Waals diameter are most important. It is found that the present approach outperforms all the other state-of-the-art scoring functions for protein-ligand binding affinity predictions of two benchmark data sets

Equations11

μ_{i} = j = 1 \sum N w_{j} Φ_{τ} (∥ r_{i} - r_{j} ∥) and f_{i} = \frac{1}{μ _{i}},

μ_{i} = j = 1 \sum N w_{j} Φ_{τ} (∥ r_{i} - r_{j} ∥) and f_{i} = \frac{1}{μ _{i}},

Φ (∥ r_{i} - r_{i} ∥)

Φ (∥ r_{i} - r_{i} ∥)

Φ (∥ r_{i} - r_{j} ∥)

Φ_{κ, τ}^{E} (∥ r_{i} - r_{j} ∥) = e^{- (∥ r_{i} - r_{j} ∥/ η_{ij})^{κ}}, κ > 0;

Φ_{κ, τ}^{E} (∥ r_{i} - r_{j} ∥) = e^{- (∥ r_{i} - r_{j} ∥/ η_{ij})^{κ}}, κ > 0;

Φ_{ν, τ}^{L} (∥ r_{i} - r_{j} ∥) = \frac{1}{1 + ( ∥ r _{i} - r _{j} ∥/ η _{ij} ) ^{ν}}, ν > 0,

Φ_{ν, τ}^{L} (∥ r_{i} - r_{j} ∥) = \frac{1}{1 + ( ∥ r _{i} - r _{j} ∥/ η _{ij} ) ^{ν}}, ν > 0,

RI_{β, τ, c}^{α} (X - Y) = k \in X \in Pro \sum l \in Y \in Lig \sum Φ_{β, τ}^{α} (∥ r_{k} - r_{l} ∥), \forall∥ r_{k} - r_{l} ∥ \leq c,

RI_{β, τ, c}^{α} (X - Y) = k \in X \in Pro \sum l \in Y \in Lig \sum Φ_{β, τ}^{α} (∥ r_{k} - r_{l} ∥), \forall∥ r_{k} - r_{l} ∥ \leq c,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Protein Structure and Dynamics · Enzyme Structure and Function

Full text

Rigidity strengthening is a vital mechanism for protein-ligand binding

Duc Nguyen1, Tian Xiao1, Menglun Wang1 and Guo-Wei Wei1,2,3 iii Address correspondences to Guo-Wei Wei. E-mail:[email protected]

1 Department of Mathematics

Michigan State University, MI 48824, USA

2 Department of Biochemistry and Molecular Biology

Michigan State University, MI 48824, USA

3 Department of Electrical and Computer Engineering

Michigan State University, MI 48824, USA

Abstract

Protein-ligand binding is essential to almost all life processes. The understanding of protein-ligand interactions is fundamentally important to rational drug design and protein design. Based on large scale data sets, we show that protein rigidity strengthening or flexibility reduction is a pivoting mechanism in protein-ligand binding. Our approach based solely on rigidity is able to unveil a surprisingly long range contribution of four residue layers to protein-ligand binding, which has a ramification for drug and protein design. Additionally, the present work reveals that among various pairwise interactions, the short range ones within the distance of the van der Waals diameter are most important. It is found that the present approach outperforms all the other state-of-the-art scoring functions for protein-ligand binding affinity predictions of two benchmark data sets.

Introduction

Protein-ligand binding is fundamental to many biological processes in living organisms. The binding process involves detailed molecular recognition, synergistic protein-ligand corporation, and possible protein conformational change. Agonist binding alternates receptor function and trigger a physiological response, such as transmitter-mediated signal transduction, hormone and growth factor regulated metabolic pathways, and stimulus-initiated gene expression, enzyme production, and cell secretion, to name only a few. The understanding of protein-ligand interactions is essential for drug design and protein design, and has been a central issue in molecular biophysics, structural biology and medicine. A common belief is that protein-ligand binding is driven by free energy reduction, which is described in intermolecular forces, such as ionic bonds, hydrogen bonds, hydrophobic effects, and van der Waals (vdW) interactions [1, 2]. However, this view has not directly translated into accurate binding affinity predictions of large scale binding data sets, despite decades of efforts. Other potential mechanisms, such as flexibility reduction or rigidity enhancement, have been neglected in the current modeling and computation.

Current understanding of flexibility with respect to protein-ligand binding is very limited. On the one hand, it is well-known that flexibility or plasticity of proteins as well as ligands facilitates the ligand docking during the binding process [3, 4]. On the other hand, protein-ligand binding reduces the system entropy, which favors the disassociation process. Since protein flexibility is intuitively associated with conformational entropy, binding induced flexibility reduction is widely regarded as unfavorable to the protein-ligand binding [5]. This work offers evidence against this prevalent view.

Thermodynamically, the protein-ligand binding process is described by the binding affinity, i.e., the change in the Gibbs free energy, which can be expressed in terms of enthalpy and entropy changes at a given temperature. The intricate interplay between enthalpy and entropy and over-simplified association between flexibility and entropy have made the role of flexibility in protein-ligand binding elusive. Fortunately, the availability of vast amount of affinity data bases [6] makes it possible to direct test out new hypotheses and reexamine existing theories and computational models. Given the importance of protein-ligand binding to a number of biological fields and disciplines, a wide variety of theoretical and computational approaches have been proposed for the binding affinity prediction. One might classifies these scoring functions into four categories, namely physics based, empirically based, knowledge based and machine learning based ones [7]. Physics based models consider vdW and electrostatics interactions between protein and ligand, in addition to hydrogen bonding and solvation effects [8, 9, 10]. Empirical or regression methods regard binding affinity as a super position of vdW interaction, hydrogen bonding, desolation, and metal chelation, etc [11, 12, 13]. Knowledge-based approaches make use of available protein-ligand binding databases to define interaction potentials and scoring functions [14, 15, 16]. Finally, machine learning strategies take advantages of large scale databases and the progress in statistical regression algorithms [17, 18, 19] to construct scoring functions that outperform other existing binding predictors [20]. However, the winning of machine learning strategies over the physics based models and the dependence of these strategies on numerous, sometimes, apparently unphysical descriptors [20] make the molecular mechanism of protein-ligand binding more elusive than ever before.

The objective of the present work is to elucidate the essential role of flexibility or rigidity in protein-ligand binding. We postulate that binding induced protein flexibility reduction, or rigidity strengthening, plays a unique and vital role in the protein-ligand binding. This hypothesis guides us to design a purely rigidity based machine learning strategy for the prediction of protein-ligand binding affinities. Protein rigidity modeling is carried out using flexibility-rigidity index (FRI) [21]. FRI has been introduced as a simple and efficient algorithm for protein flexibility and thermal fluctuation analysis [21]. It has been shown to be about 20% more accurate and orders of magnitude more efficient than other classic approaches, such as Gaussian network model (GNM) [22], and anisotropic network model (ANM) [23] in the B-factor prediction of hundreds of proteins [21, 24, 25] and protein-nucleic acid complexes [26]. For example, it predicts the B-factors of the entire HIV capsid with 313,236 residues in less 30 second, which would require GNM more than 120 years to compute were the memory not a problem [24]. Our postulation is confirmed by the evaluation of rigidity index (RI) over 195 protein complexes. Our rigidity index based binding affinity predictor, RI-Score, is able to outperform all other eminent scoring functions, which strongly supports our hypothesis that flexibility reduction or rigidity enhancement is a vital mechanism in protein-ligand binding.

Flexibility-rigidity index (FRI)

Consider a biomolecule having $N$ atoms with coordinates given as $\{{\bf r_{i}|r_{i}}\in\mathbb{R}^{3},i=1,2,\dots,N\}$ . We denote $\|\mathbf{r}_{i}-\mathbf{r}_{j}\|$ the Euclidean distance between $i$ th and $j$ th atom. We denote $r_{i}$ the van der Waals radius of $i$ th atom and set $\eta_{ij}=\tau(r_{i}+r_{j})$ as a scale to characterize the distance between the $i$ th and the $j$ th atoms. Where $\tau>0$ is an adjustable parameter. The atomic rigidity index and flexibility index are expressed as

[TABLE]

where $w_{j}$ are the particle-type dependent weights, and $\Phi$ is a real-valued monotonically decreasing correlation function satisfying

[TABLE]

Commonly used FRI correlation functions include generalized exponential functions

[TABLE]

and generalized Lorentz functions

[TABLE]

As shown in Fig. 1, both generalized exponential functions and generalized Lorentz functions approximate the ideal low-pass filter (ILF) as $\kappa\rightarrow\infty$ and $\nu\rightarrow\infty$ , respectively.

FRI measures the topological connectivity of the protein-ligand network at every node with appropriate distance-based weights and describes the binding complex with a high level of abstraction. Such an abstraction is ideally suited for the extraction of intrinsic protein-ligand interactions from complex and large-scale binding datasets. One advantage of FRI is that it allows the multiresolution analysis of protein-ligand binding interactions by varying parameter $\tau$ , which endows FRI the ability to explore what is the dominant protein-ligand interaction force. Another advantage of FRI is its multiscale analysis via multi-kernel based multiscale FRI (mFRI) [25], which enables FRI to capture different length-scales in various protein-ligand interactions. Finally, by using an ILF representation, the FRI is able to quantitatively detect the relevant length of interactions.

Binding induced rigidity strengthening

We first illustrate how the protein rigidity is quantitatively strengthened after ligand binding. To this end, we define protein relative rigidity change $\Delta\mu({\rm Com})$ by $\Delta\mu({\rm Com})=\sum_{j}\left[\frac{\mu_{j}({\rm Com})-\mu_{j}({\rm Pro})}{\mu_{j}({\rm Pro})}\right]$ , where $\mu_{j}({\rm Com})$ and $\mu_{j}({\rm Pro})$ are the rigidity indexes of the $j$ th heavy atom of the protein-ligand complex ( ${\rm Com}$ ) and the original protein ( ${\rm Pro}$ ), respectively[21]. Figure 2 depicts the protein relative rigidity change ( $\Delta\mu$ ) against each complex in the PDBBind v2007 core set (N=195) [6]. It is interesting to note that all the relative rigidity changes are positive, which confirms our hypothesis that ligand binding enhances protein rigidity and reduces protein flexibility.

Rigidity index based scoring functions (RI-Score)

Having confirmed that ligand binding enhances protein rigidity, it remains to establish that rigidity strengthening is a vital mechanism for protein-ligand binding. Our hypothesis is that, for blind prediction of protein-ligand binding affinities, if a quantitative model solely based on rigidity analysis is very accurate and more competitive than other state of the art models in the field, it proves that rigidity strengthening is a vital mechanism for protein-ligand binding. To prove our hypothesis, we exclude electrostatic, van der Waals, hydrogen bond, hydrophobic and hydrophilic interactions used in conventional force field based models [2, 8, 9, 10], including ours [20] and consider nothing but protein rigidity change upon ligand binding. We define element-specific protein-ligand rigidity index by collecting cross correlations

[TABLE]

where $\alpha={\rm E,L}$ is a kernel index indicating either the exponential kernel ( ${\rm E}$ ) or Lorentz kernel ( ${\rm L}$ ). Correspondingly, $\beta$ is kernel order index such that $\beta=\kappa$ when $\alpha={\rm E}$ and $\beta=\nu$ when $\alpha={\rm L}$ . We adopt our fast FRI (fFRI) based on the cell lists algorithm [27] with a cutoff distance $c$ to reduce computational complexity [24]. Here, ${\rm X}$ denotes a type of heavy atoms in the protein ( ${\rm Pro}$ ) and ${\rm Y}$ denotes a type of heavy atoms in the ligand ( ${\rm Lig}$ ). Four types of protein heavy atoms, namely $\{{\rm C,N,O,S}\}$ , and nine types of ligand atoms, i.e., $\{{\rm C,N,O,S,P,F,Cl,Br,I}\}$ , are utilized in this work. Unlike force field based methods which require sophisticated data processing, the current model applies directly to x-ray crystallography data.

Results and discussion

The PDBBind v2007 benchmark

For quantitative prediction of protein-ligand binding affinities, we combine protein-ligand rigidity index in Eq. (6) and machine learning to construct RI-Score. Although machine learning can be a powerful approach for modeling massive datasets, its performance depends crucially on its feature vectors. Therefore, if rigidity strengthening is truly the dominant mechanism for protein-ligand binding, the proposed RI-Score should be able to do well in the binding affinity prediction of massive experimental data sets. Although a specific machine learning algorithm, Random Forest, is used in this work, other machine learning techniques, such as gradient boosting algorithms, deliver similar results. To validate RI-Score, we consider a benchmark dataset, PDBBind v2007, to validate our RI-Score against a large number of eminent scoring functions [6, 28, 29]. More specifically, the PDBBind v2007 core set of 195 protein-ligand complexes is utilized as our test set, while our model is trained on the PDBBind v2007 refined set of 1300 protein-ligand complexes, excluding the PDBBind v2007 core set of 195 complexes [6] (i.e., $N=1105$ ).

We first consider exponential kernels which are known for their fast decay and thus can be used to detect the effective length of of short-, medium- and long-range interactions. To this end, we investigate the behavior of high-order exponential kernels with large $\kappa$ values, which are essentially the ideal low pass filter as shown in Fig. 1 and thus are able to exclude all interactions beyond the kernel length scale. Figure 3(a) depicts the Pearson correlation coefficients of four high order exponential kernels over a wide range of scales. It is interesting to find surprising oscillations in the Pearson correlation coefficients over different scales. Such oscillations indicate the inclusion of different numbers of nearest residue layers. For example, when $\tau=1.5$ , $\eta_{\rm CC}$ for a ${\rm C-C}$ pair is 5.1Å. Since one of these two carbons belongs to the protein and the other belongs to the ligand, the only hydrophobic interactions that can be accounted are those from the nearest layer of residues. The next peak occurs at $\tau=3$ (i.e., $\eta_{\rm CC}=10.2$ Å), which is due to the inclusion of the interactions of the nearest two layers of residues. It is amazing to note that these oscillations persist for two more peaks to reach the best predictions at $\tau=5.5$ , which suggests that interactions from the nearest four layers of residues effectively contribute to the protein-ligand binding. To our knowledge, it is very difficult for physics based methods to track such long-range interactions. The present finding has significant ramifications to protein design.

We note that when $\tau=1.5$ , Pearson correlation coefficients obtained by all exponential kernels are already better than those attained by most other methods in the field [6, 28, 29]. This good performance is due to the fact that at $1.5(r_{i}+r_{j})$ , all the hydrogen bonds, most van der Waals interactions, and good portion of electrostatic interactions are included in our RI-Score. It is well known that hydrogen bond interactions can be effective from 2.2 to 4 Å, with donor-acceptor distances of 2.2-2.5Å as strong, mostly covalent, 2.5-3.2Å as moderate, mostly electrostatic, and 3.2-4.0Å as weak, electrostatic [30]. Their corresponding energies are about 40-14, 15-4, and less than 4 kcal/mol, respectively [30].

The impact of cutoff distance as shown in Figure 3(b) also show an oscillatory pattern. In fact, this oscillation is consistent with that shown in Figure 3(a).

Unlike exponential kernels, Lorentz kernels are well known for their slow decay, which is able to capture interactions over a wide range of distances. Figure 4 shows the influence of the power of the Lorentz kernel, the scale, and cutoff distance to the blind prediction accuracy in terms of Pearson correlation coefficients with respect to the experimental binding affinity data. With a sufficiently large cutoff values $c=40$ Å, best prediction is obtained at $\nu=5$ and $\tau=1$ . This result reveals that most important protein-ligand interactions occur approximately at van der Waals distances, which strongly indicates that short range hydrogen bond interactions are relatively more important than long range interactions. When $\nu=5$ and $\tau=1$ , a cutoff value of 20Å, which indeed contains four layers of residues, is found to be large enough to include all the essential protein-ligand interactions.

Protein-ligand interactions intrinsically involve multiple characteristic length scales, such as those for three types of hydrogen bonds, van der Waals bonds, and hydrophobic interactions, which can still be very important for residues far away as indicated above. In our earlier study, we found that multiscale FRI (mFRI) that utilizes two or three FRI kernels parametrized with different length scales can significantly improve the B-factors prediction. In this work, we are interested in examining the impact of multiscale rigidity to binding free energy prediction. Machine learning approach makes it particularly convenient to incorporate multiscale effect in predictive models. One just needs to construct one additional set of features at a desirable scale. To this end, we construct two sets of RI feature vectors, with one of the set parametrized at $\tau_{1}$ and the other at $\tau_{2}$ . In general, we denote these feature vectors by RI ${}^{\alpha_{1}\alpha_{2}}_{\beta_{1},\tau_{1},c_{1};\beta_{2},\tau_{2},c_{2}}$ as a straightforward extension of our notation. To reduce the number of degrees of freedom, values of $\nu(\text{or~{}}\kappa)$ and $c$ are adopted from the corresponding single-kernel model. As a result, we only need to search for two scale parameters $\tau_{1}$ and $\tau_{2}$ . Figure 5 illustrates the impact of two-scale RI feature based RI-Score predictions of the PDBBind v2007 core set of 195 complexes. The median of $R_{p}$ of RILL and RIEE are plotted against different pairs of scale parameters $(\tau_{1},\tau_{2})$ varying from (0.5,0.5) to (20,20). First, it is easy to see that two-scale predictions are typically better than the single scale one. The best predictions are typically achieved at the combination of a relatively small-scale kernel and a relatively large-scale kernel. A scale optimized model, RI ${}^{\rm LL}_{5,1,39;5,12.5,47}$ , delivers the median and the best Pearson correlation coefficient $(R^{m}_{p},R^{b}_{p})=(0.800,0.811)$ . Similarly a scale optimized model, RI ${}^{\rm EE}_{2.5,1,12;2.5,5.5,21}$ , achieves $(R^{m}_{p},R^{b}_{p})=(0.797,0.809)$ . More details are reported in Table SI in Supporting Information. These findings again confirm the importance of incorporating multiscale in the protein-ligand predictions.

Encouraged by the above two-scale results, we have also explored the utility of three-scale RI-Score models the prediction of the PDBBind v2007 core set. Certainly, a complete search of the parameter space is prohibitively expensive. We therefore focus on exploring the scale parameters $\tau_{1}$ and $\tau_{2}$ for certain sets of other parameters. It is found that RILLE is the best model with $(R^{m}_{p},R^{b}_{p})=(0.803,0.825)$ . The corresponding RMSE of our best prediction is 1.99 kcal/mol. More details of predicted energies by RI-Score are provided in the Supporting Information.

Predictions from different scoring functions on PDBBind v2007 core set have been presented in the literature [6, 28, 29]. Figure 6 plots the performance of these scoring functions along with our prediction, i.e., RILLE (highlighted with red color). Clearly, RI-Score outperforms all the other eminent scoring functions. The correlation between experimental binding free energies and the best prediction attained by RILLE is also presented in the figure.

The PDBBind v2013 benchmark

It remains to show that the outstanding performance of proposed RI-Score is not limited a specific data set. To this end, we consider PDBBind v2013 core set of 195 protein-ligand complexes as our test set [32]. The PDBBind v2015 refined set of 3706 protein-ligand complexes, excluding the PDBBind v2013 core set, is employed as our training set. Our best model for the PDBBind v2013 core set which is the same as PDBBind v2015 core set, is given by RI ${}^{\rm LLE}_{6,4,15;6,10,37;3.5,1.5,8}$ with a Pearson correlation coefficient and RMSE of 0.782 and 2.051 kcal/mol, respectively. More details of RI-Score predictions are provided in Table S3 of the Supporting Information.

PDBBind v2013 core set is also a popular test set for benchmarking scoring functions. For comparison, Fig. S8 plots the performance of our method and 25 other scoring functions on this benchmark with data taken from Refs. [32]. Again, RI-Score is found to be significantly more accurate than other eminent methods.

Conclusion

Protein-ligand binding is of paramount importance to biomolecular functions and biomedicine. The molecular mechanism of protein-ligand binding remains an active research topic, despite of enormous effort in the past few decades. The present work qualitatively demonstrates that protein-ligand binding gives rise to protein rigidity enhancement. Further study shows that rigidity index alone is able to quantitatively describe protein-ligand binding affinities. The fact that rigidity index based scoring function, RI-Score, offers the best binding affinity prediction over two benchmark data sets indicates that rigidity strengthening is a dominant mechanism in protein-ligand binding. Therefore, protein flexibility reduction drives protein-ligand binding. It is well known that protein flexibility is associated with many protein functions. The present finding strongly suggests that agonist protein receptor binding inhibits protein functions through protein flexibility reduction. Additionally, a significant correlation of the nearest four layers of residues to protein-ligand binding affinities is discovered, which has a nontrivial ramification to drug and protein design.

Supporting Information Available SI_RI-score.pdf provides discussion of dataset preparation, and some additional results for PDBBind v2007 core set and PDBBind v2013 core set.

Acknowledgments

This work was supported in part by NSF grants IIS-1302285 and DMS-1160352 and MSU Center for Mathematical Molecular Biosciences Initiative.

Bibliography32

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. K. Gilson, J. A. Given, B. L. Bush, and J. A. Mc Cammon, “The statistical-thermodynamic basis for computation of binding affinities: a critical review.,” Biophysical journal , vol. 72, no. 3, p. 1047, 1997.
2[2] M. K. Gilson and H. X. Zhou, “ Calculation of protein-ligand binding affinities,” Annual Review of Biophysics and Biomolecular Structur , vol. 36, pp. 21–42, 2007.
3[3] M. I. Zavodszky and L. A. Kuhn, “Side-chain flexibility in protein–ligand binding: The minimal rotation hypothesis,” Protein Science , vol. 14, no. 4, pp. 1104–1114, 2005.
4[4] P. Tuffery and P. Derreumaux, “Flexibility and binding affinity in protein–ligand, protein–protein and multi-component protein interactions: limitations of current computational approaches,” Journal of The Royal Society Interface , vol. 9, no. 66, pp. 20–33, 2012.
5[5] R. Grünberg, M. Nilges, and J. Leckner, “Flexibility and conformational entropy in protein-protein binding,” Structure , vol. 14, no. 4, pp. 683–693, 2006.
6[6] T. Cheng, X. Li, Y. Li, Z. Liu, and R. Wang, “Comparative assesment of scoring functions on a diverse test set,” J. Chem. Inf. Model. , vol. 49, pp. 1079–1093, 2009.
7[7] J. Liu and R. Wang, “Clasification of current scoring functions,” Journal of Chemical Information and Model , vol. 55, no. 3, pp. 475–482, 2015.
8[8] P. A. Kollman, I. Massova, C. Reyes, B. Kuhn, S. Huo, L. Chong, M. Lee, T. Lee, Y. Duan, W. Wang, O. Donini, P. Cieplak, J. Srinivasan, D. A. Case, and T. E. Cheatham, “Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models.,” Acc. Chem. Res , vol. 33, pp. 889–897, 2000.