Sampling Conformational Ensembles of Highly Dynamic Proteins via Generative Deep Learning
Talant Ruzmetov, Ta I Hung, Saisri Padmaja Jonnalagedda, Si-han Chen, Parisa Fasihianifard, Zhefeng Guo, Bir Bhanu, Chia-en A. Chang

TL;DR
This paper introduces a deep learning model to sample protein conformations, revealing new insights into dynamic proteins like Aβ42.
Contribution
A novel unsupervised deep learning model, ICoN, is developed to sample conformational ensembles of dynamic proteins.
Findings
ICoN learns physical principles of conformational changes from MD simulations.
The model generates novel conformations with detailed sidechain and backbone arrangements.
Synthetic conformations of Aβ42 revealed clusters that align with experimental data.
Abstract
Proteins are inherently dynamic, and their conformational ensembles are functionally important in biology. Large-scale motions may govern protein structure–function relationship, and numerous transient but stable conformations of intrinsically disordered proteins (IDPs) can play a crucial role in biological function. Investigating conformational ensembles to understand regulations and disease-related aggregations of IDPs is challenging both experimentally and computationally. In this paper first an unsupervised deep learning-based model, termed Internal Coordinate Net (ICoN), is developed that learns the physical principles of conformational changes from molecular dynamics (MD) simulation data. Second, interpolating data points in the learned latent space are selected that rapidly identify novel synthetic conformations with sophisticated and large-scale sidechains and backbone…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Structure and Dynamics · Machine Learning in Materials Science · Metabolomics and Mass Spectrometry Studies
Introduction
Proteins are complex and have dynamic properties. Conformational ensembles of proteins are essential for performing biological processes, including enzyme activities, protein folding, protein–ligand binding, and protein aggregation^1,2^. Characterization of protein conformations allows researchers to understand protein function, activity, and mechanisms. However, the tasks may be daunting and are even more challenging for investigating intrinsically disordered proteins (IDPs), for which numerous conformations can be functionally important. Some IDPs, such as amyloid-β_1–42_ (Aβ42), are related to a number of protein aggregation-related diseases, including Alzheimer’s disease^3^. Revealing important conformations with atomistic details is crucial to fully understand the function of a protein and its regulation involved in basic science, therapeutics design and biotechnology.
Experimental techniques determine protein structures and probe their dynamics, and bioinformatics tools such as AlphaFold2 and RoseTTAFold can accurately predict structures from given protein sequences^4–6^. However, these techniques typically provide a handful of structures with limited information on dynamics. Advanced computational methods are widely used to investigate protein dynamics and sample protein conformational ensembles. A powerful method is all-atom molecular dynamic (MD) simulations, which is based on physical principals to sample protein conformations and model their dynamics^7,8^. However, many stable protein conformations are separated by free energy barriers: an MD run can take excessively long computational time (i.e., beyond microsecond timescale) to cross energy barriers for sampling various conformations. Other methods developed to tackle sampling limitations include conformational search methods, MD-based enhanced sampling techniques, and Monte Carlo simulations^6,9–12^. A popular analysis tool, principal component analysis (PCA), has been used to extract information from sampled protein conformations to guide conformational search^13,14^. Low-mode based conformational search methods use normal mode analysis (NMA) to perturb given structures along their low-mode eigenvectors to generate new conformations^15–18^. Although these sampling approaches effectively extend the time scale of MD simulations, efficiently obtaining adequate conformations for highly flexible protein systems and/or overcoming significant energy barriers is still challenging.
Artificial intelligence (AI) provides an alternative approach for accelerating the generation of protein conformational ensembles. Novel machine learning/deep learning (ML/DL) approaches analyze simulation results to further guide conformation search with enhanced sampling techniques^19–27^. ML/DL optimizes coarse-grain models to speed up conformation transitions with preserved atomistic interactions^28,29^. Several DL models accelerate protein conformational searches, such as training neural networks to model the distribution of conformational ensembles with generative modeling^30–39^. By training with conformations obtained from MD simulations, these approaches can quickly generate conformational ensembles and bypass the kinetic barriers, the bottleneck of MD simulation.
Generative models study conformations of chemical compounds and proteins in various settings; examples include coarse-grained models and protein conformations presented by a backbone^28,40^. Ideally, the generative models can find new and thermodynamically stable protein conformational ensembles not seen in the training dataset obtained from MD simulations, as demonstrated in recent publications focusing on sampling conformations for IDPs^41–45^. Nevertheless, proteins have complex conformational ensembles, and many conformational transitions are driven by backbone motions and, more importantly, sidechain interactions such as salt bridges, hydrogen bonds, and hydrophobicity. As a result, to find new protein conformations, features that can accurately describe all possible motions in atomistic details are critical in generative AI models.
In this study, we present a generative DL model, termed Internal Coordinate Net (ICoN), which is trained on datasets from MD simulations to identify new conformations efficiently and accurately for highly flexible protein systems. ICoN uses novel features and an atomistic bond-angle-torsion (BAT)-based vector representation, vBAT, to smoothly rotate various dihedral angles that lead to new conformations. Our model learns the physical principles that govern molecular motions, and the 3D latent space contains the transformations of various degrees of freedom (DOF), mainly dihedral rotations, to allow for efficient conformational searches. Essential protein motions analyzed by dihedral Principal Component Analysis (dPCA) are clearly revealed in the latent space, and generating new conformations is achieved by data-point interpolation within the latent space. Using a fraction (1%) of MD data as a training set, we demonstrated that ICoN rapidly found thousands of new, conformationally distinct and thermodynamically stable conformations for αB-crystallin57–69 and the Aβ42 monomer (Figure 1) in a few minutes using a single gaming GPU card. Post-analysis of our novel synthetic Aβ42 monomer conformations revealed various new conformations, including new Arg5-Ala42 contacts and a new Asp23-Lys28 salt bridge not seen in the MD training set but reported in existing publications. Our synthetic conformations covered reported structures that were suggested to be less toxic or more prone to oligomerization^46–52^. Our model provides a computationally efficient approach to identify new conformations for highly flexible protein systems. The novel synthetic Aβ42 monomer conformations also provide insights into Aβ42 oligomerization.
Results
ICoN model architecture, training, and validation.
Molecular representation using the vBAT coordinate.
The ICoN model is a latent space-guided generative AI model trained on protein conformational ensembles to learn physical properties that govern conformational transitions in MD simulations (Figure 2). Because the dihedral rotations are the major motions to determine conformations, using BAT to accurately describe the concerted motions of multiple dihedral rotations is critical. Dihedral rotations have a periodicity issue, so we used a vector called vBAT to avoid periodicity (Figure 2B and details in Supplementary Material). All-atom vBAT internal coordinate representation is inherently equivariant to rotations and translations and can exactly convert vBAT vectors to Cartesian coordinates without any additional approximation. However, coordinate conversions can be time-consuming. Thus, our model implemented GPU acceleration to perform coordinate transformations; converting 5,000 frames from vBAT to Cartesian coordinates took < 0.2 s for Aβ42. Figure S2 demonstrates the efficiency in coordinate transformations that successfully avoided a computation bottleneck with the use of internal coordinates.
ICoN network architecture and training.
The neural network is based on autoencoder architecture (Figure 2A). Classical BAT coordinates (the Z matrix) have 3N-6 DOF for presenting the internal motions of a molecule, where N is the number of atoms, and the 6 external translation and rotation DOF are eliminated. Because vBAT uses the vectors instead of one value for each bond, angle and dihedral DOF, we have 4×3x(N-3) features for each molecule. As a result, αB-crystallin57–69 and Aβ42 have 2376 and 7488 features, respectively. The hyperparameter training in the encoder resulted in 7 layers to reduce a large dimension representation into 3D, so each molecular conformation can be presented in the 3D latent space (Figures 2A and S3). The decoding process brought the representation from 3D back to the original dimensions with the same number of layers but in reverse order. Each conformation is converted from vBAT to Cartesian coordinates for further analysis. Throughout the iterative training process, in addition to learning a relationship between a vast number of features, the network also learns how to meaningfully compress the representation into the lower-dimensional 3D latent space by performing nonlinear dimensionality reduction.
Notably, although this study used 3D latent space for easy observation of conformation distribution and visualization of interpolation, the network can reduce 4×3x(N-3) features to any dimension desired by users. We used 3D to achieve efficient training and reduced memory consumption as well. It took 4 min with one NVIDIA 1080Ti GPU card to perform 15000 epochs for successful training with 10,000 frames of Aβ42.
Validation of the models by conformation reconstruction.
A successful trained model should accurately reconstruct a protein conformation from the reduced 3D representation to the original conformation. To evaluate the accuracy, we presented conformations from our validation set using the 3D representation and then reconstructed them back to atomistic Cartesian representation and computed the root mean square deviation (RMSD) of all heavy atoms between the 2 conformations, the original and reconstructed ones. A few representative conformations are illustrated in Figure 3: αB-crystallin57–69 has RMSD < 0.9Å and Aβ42 has RMSD < 1.3Å. As shown in Table 1, with our model, both backbone and sidechains conformations can be accurately reconstructed back to the original input structure, ambient after large dimensionality reduction in the latent space. We also performed a more derailed comparisons between the original and reconstructed conformations, such as the distribution of each dihedral rotation and their correlations, to ensure that all the properties were accurately reproduced (Figures S4–S7). Notably, both αB-crystallin57–69 and Aβ42 exhibit large-scale protein motions included in our training set. The validation demonstrates the robustness of our model to learn a diverse range of conformational states with high precision, and we can describe small sidechain rotations accurately.
Learned physics in the latent space of αB-cristallin57–69
Because our ICoN model learns important dihedral rotations that determine molecular conformations, the latent space stores information that leads to smooth conformational transitions between 2 datapoints, such as various sets of concerted dihedral rotations. Moreover, we can visualize conformation transitions and clusters and can perform data analysis in the 3D latent space (Figures 4 and S9). We demonstrated that the DL model learned the physics governing molecular motions, and the information captured in the latent space can be effectively utilized.
Essential motions of αB-cristallin57–69 revealed in the latent space.
We selected 2 conformations in the latent space, Conf indexes #50 and #53, from Figures 4 to examine the conformational transitions in the latent space. Note that the MD saved one frame every 1 ps, and we re-saved a frame every 100 ps when preparing our training and validation sets. Therefore, Conf indexes #50 and #53 (2 red dots in Figure 4D) are significantly different; they present 2 end points from 300-ps time-period MD. To illustrate the conformation transition, we superimposed 4 conformations (indexes #50 to #53 Figure 4A) and found that the first 6 residues were maintained in a helix secondary structure, and the middle region around Leu9 showed minor but important fluctuations to twist the center of the peptide, whereas the flexible tail (residues 7 to 13) exhibited greater dynamics. A correlated sidechain movement (i.e., sidechains of Leu9 and Met12) was observed during the transitions as well.
In the latent space, we found that the 300-ps interval MD trajectory sampled conformational jiggling during the transitions (light blue dots in Figure 4D). To extract the essential motions from the conformation fluctuation, we applied PCA with BAT coordinates to analyze the major motions from the 300-ps time-period MD to quantify the motions. The first Principal Component (PC) revealed the major motions, which removed unessential fluctuations (Figure 4B). Presenting these essential motions in the latent space resulted in a very smooth curve (green dots in Figure 4D). Note that in the latent space, the PC motions appeared as a non-linear curve, not a straight line. Figure S9C exhibits more smooth curves for motions following the second PC and the first + second PC modes, so the latent space recorded similar information obtained from PCA. As for PCA, our ICoN learned the important DOF that led to conformational transition (i.e., various sets of concerted dihedral rotations), and the latent space retained the information from DL. Therefore, interpolation between 2 dots in the latent space can efficiently sample protein motions that are led by movements from critical DOFs. As illustrated in Figure 4, the interpolation (black dots in Figure 4D) revealed the learned motions, which also showed a more rigid helix secondary structure in the first 6 residues and the fluctuations in the middle region around Leu9 to result in more distinct conformations with smooth transitions (Figure 4C). Importantly, our model accurately identified the concerted motions shown in both MD and PCA. These essential motions were in good agreement with those detected by dihedral PCA and our observation of the MD trajectory. Of note, our vBAT and model can precisely learn and present very small conformational changes, such as the methyl rotation of Met12.
Interpolation-generated synthetic conformations of αB-cristallin57–69.
Because ICoN identified DOF that determine conformational changes, we used the latent space interpolation as a conformational search engine to find more synthetic conformations. The search was achieved by interpolation between 2 dots with consecutive indexes (i.e., Conf indexes #10 and #11). Following the essential motions, we anticipated finding both new conformations and existing ones sampled by MD. Using the same non-linear interpolation function illustrated in Figure 4D, 10 points were generated for each pair, yielding 99,990 synthetic conformations from interpolating a total of 9,999 pairs of dots. We first eliminated high-energy conformations, then each new synthetic conformation was compared with its predecessors and eliminated if it was a repeat, yielding 43,484 distinct conformations (Table 1). We further compared these synthetic conformations with the raw MD data with 500,000 frames to identify 38,425 novel synthetic conformations.
We also examined the stability of the synthetic conformations quantified by their conformational energies using the molecular mechanics/generalized Born and surface areas (MM/GBSA) calculations. Conformations sampled from MD and our deep learning model have similar energy distribution (Figure 5A), which validates our synthetic conformations as being thermodynamically stable. In addition, we also compared our synthetic conformations with conformations obtained by another MD run (Table 1, Seed 2); none of the conformations from this MD was used in our training set. Our interpolation sampling found similar conformations found by another MD run Figure 5B, which demonstrates the efficiency and accuracy of the use of the latent space for conformational sampling.
Latent space interpolation-generated synthetic conformations of Aβ42
The Aβ42 monomer exhibits a broad spectrum of conformations, from random coil to more structured α-helical and β-sheet conformations. Sampling strategies that can accurately model sidechain motions are crucial because the sidechain arrangements govern the intramolecular attractions that lead to various local turns and pre-organized shapes for subsequent oligomerization. Using MD Runs 1 to 4 (Table 1), we used the same strategy used for αB-cristallin57–69 to interpolate 2 consecutive dots in the latent space for obtaining synthetic conformations. The 4 MD runs were initiated by a different Aβ42 monomer structure and/or with a different force field. To elucidate the search efficacy, we plotted the distribution with coordinates of Rg and RMSD (Figure 6) for MD Run1–4 and our synthetic conformations from Model 1–4 using these MD runs for training. The search using generative AI efficiently found many new conformations without the need for lengthy MD simulations (Table 1).
To validate that our synthetic conformations are thermodynamically stable, we again plotted energy distribution of synthetic conformations found in Model 1 (Figure 5C). We also compared the synthetic conformations to another 3 MD runs using the same initial structure and IDP force field (MD Run1.1–1.3) to examine whether our search could find those sampled by other MD simulations. Although the Aβ42 monomer has numerous conformations, we could identify similar conformations (Figure 5C). In addition to checking the total conformational energy, we compared each energy component to verify that both bonded and non-bonded terms have energy distribution similar to those modeled by MD runs (Figure S10 and S11). Notably, the energy distribution from the synthetic conformations is relatively broader, thereby suggesting a greater diversity in the sampled conformations, which cover high and low energy regions. The energy comparison underscores the high quality of structural integrity of the synthetic conformations.
Analysis and biological implications of novel synthetic Aβ42 conformation ensembles
Obtaining Aβ42 monomer conformations is a critical step for understanding the mechanisms of initial encounters and interactions between monomers, which lead to subsequent oligomerization and fibrillization. Because the oligomerization steps can be highly sensitive to different environments (e.g., different membrane, existing fibrils, or ion concentrations), substantially different monomer conformations may initiate aggregation using different mechanisms in various environments. Several experiments and modeling work also suggest that some conformations are prone to be aggregated or non-toxic^46–50^. Because of various experimental results regarding the monomer structure–function relationship in Aβ42 aggregation, we focus on structures with salt bridges and R5-A42 contacts to demonstrate the utility of the conformational search results from our generative AI model.
Of note, because our method efficiently sampled numerous low-energy monomer conformations, we used synthetic conformations found using Model 1 (Table 1) to demonstrate the biological importance of these conformations revealed by our ICoN model.
Major conformation turns in a global tertiary structural ensemble.
The 127,673 distinct conformations (Table 1) are presented in the pairwise residue contact map (Figure 7A). Four local bends are marked as turn A (F4-H6), turn B (E11-H14), turn C (S26-K28), and turn D (V36-G38). We found novel synthetic conformations with all 4 turns (Figure 1D). Turns A and B are located near the N-terminal, which is highly flexible and believed to be the metal binding region under abnormal physiological conditions (residues D1 to K16)^53^. Turn C situates close to a central polar region that contains many hydrophilic residues (E22 to G29). Turn D resides in the C-terminal hydrophobic region (A30 to A42). These turns directly correlate with the local and/or overall Aβ42 conformations, where some turns are commonly reported. For example, turn regions spanning the center residues (K16-G25), in particular turn C referenced as “β-turn-β” motif, have been reported in Aβ42 aggregates^54,55^. Although the conformations/turns and their population are obtained using only Model 1, they provide an overview of spatial arrangements of the Aβ42 monomer. Notably, although Aβ42 is intrinsically disordered, the protein sequence still can lead to many low-energy and highly popular conformations. Turn A locates in the highly flexible N-terminus, where mutation substitutions of Arg5 suppressed the aggregation of Aβ42^47^. Our EPR measurements also showed distinct spectral features at residue 5, indicating that Arg5 may feature a distinct structure or interactions as compared with the nearby residues (Figure 8). The novel synthetic conformations revealed that Arg5 can form interactions with Ala42 with or without the presence of Turn A, as illustrated in Figures 7B and 7C, respectively. Of note, turn D (V36-G38) is present alongside, which may help to orient and stabilize the C-terminus to form an interaction with Arg5 (red circle in Figures 7B–C). Our previous work showed that substituting a nitroxide spin label compound R1 at positions G37 and G38 altered the kinetics of Aβ42 fibril formation^56^, and these synthetic conformations provide a possible monomer structure to guide future experimental design.
Presence of salt bridges in toxic and less toxic monomer conformations.
Salt bridges play an important role in providing intra-molecular attractions and conformational specificity that directly relate to Aβ42 aggregation. This non-covalent interaction forms when sidechains of oppositely charged residues are close enough to each other to experience electrostatic attractions. Therefore, protein structures must be described with precise atomistic details and sidechain motions sampled accurately. Although the D23-K28 salt bridge did not exist in the 1-μs MD Run1 used in our training set, by using interpolation in the latent space, the model sampled novel synthetic conformations with a new salt bridge, D23-K28 (Figure 9), which is reported to promote aggregation^46^. Experiments suggested the importance of this salt bridge, but the conformation ensemble was never determined experimentally. The novel synthetic conformations revealed distinctly different rearrangements with the D23-K28 salt bridge, ranging from more extended to highly packed conformations (Figure 9). The D23-K28 constraint results in Turn C, which is important to oligomerization.
The synthetic conformations revealed novel conformations with the salt bridge E22-K28 with significantly different conformations (Figures 9D–F). This salt bridge was reported to prevent a toxic turn at positions E22 and D23, for potentially a less-toxic monomer^57^. The synthetic conformation presents key features suggested by experiments, with no turn at positions 22 and 23 when the salt bridge E22-K28 exists (Figure 9). Nevertheless, other publications suggested that the salt bridge E22-K28 may help form other intramolecular contacts (e.g., contacts between the C-terminus and Arg5) to stabilize a hairpin structure and promote dimer formation^47^. Although which salt bridges promote or inhibit monomer aggregation is debatable, our novel synthetic conformations provide various possible sidechain rearrangements and salt-bridge conformations for further interpreting experiments in understanding aggregation mechanisms.
Discussion
In this work, we demonstrated the ability of the DL model ICoN, which applies learned physics of protein motions, to directly generate new protein conformational ensembles. The transferable conformational search approach applied a neural network model using protein conformations generated by classical MD simulations with all-atom representation in explicit solvent model to train an autoencoder. Here we re-saved frames from MD trajectories without further processing the data. We reduced protein atomic coordinates to a 3D latent space, with each conformation represented by one data point in the 3D plot. Each data point can be accurately reconstructed back to a protein structure as well. ICoN uses interpolation between 2 data points in the 3D latent space to guide the search for conformational changes, with natural molecular motions identified by the ICoN model. The BAT-based (vBAT) feature overcomes problems with Cartesian coordinates that fail to smoothly present dihedral rotation during conformation sampling, which is particularly important in sampling large-scale motions such as those shown in IDPs. vBAT also effectively handles dihedral periodicity. The model is highly efficient for identifying novel synthetic Aβ42 conformation ensembles not seen in the MD trajectory used for training. These ensembles have different large-scale motions (i.e., new compact conformations) and detailed sidechain rearrangements such a forming a new salt bridge, D23-K28.
The synthetic Aβ42 conformations sampled by the ICoN model provide a comprehensive snapshot of the conformational landscape for Aβ42 monomers. For example, the ICoN model finds the formation of salt bridges involving residues E22 or D23 (Figure 8), both of which are hotspots for familial mutations in Alzheimer’s disease. The ICoN model also identifies four local bends at residue positions 4–6, 11–14, 26–28, and 36–38 (Figure 1D). Although each of these turns have been previously observed in isolation, the observation of all four turns in Aβ42 monomers suggests that the formation of these turns may precede the assembly of specific oligomeric or fibrillar structures. The synthetic conformations containing these turns provide a foundation for investigating how these turns affect the aggregation pathway or formation of specific aggregate structures.
Training of the ICoN model requires sufficient protein conformations, so the DL can reveal natural motions of the protein to guide the search toward new low-energy structures. Because experimentally determined structures typically cover only a small proportion of the overall conformation ensembles, we need to rely on MD simulations. However, force-field parameters used in MD may affect the performance in obtaining protein conformations and dynamics. Because conformation ensembles with artificial bias in the training set will affect the performance of ICoN, using a suitable force field for MD simulations is crucial. For example, conventional force fields parameterized using structured proteins can over-stabilize disordered proteins/regions, such as generating more helical structures in Aβ42 when simulated by Amber ff14SB, as compared with diverse disordered Aβ42 structures simulated by an IDP-specific force field, ff14IDPSFF^58^. The model trained by MD runs using a less flexible ff14SB affected the search results and yielded fewer novel synthetic conformations (Table 1). Note that our training set did not need to cover the complete conformational distribution. However, conformation fluctuation governed by key dihedral rotations should be included in the training set so that the model can identify these key elements for guiding the conformational search to generate novel synthetic conformations.
Conceptually, ICoN performs as a conformational search method rather than a deterministic simulation method. Therefore, the search results do not directly reflect the equilibrium distributions of protein conformations. However, protein conformations are populated according to a Boltzmann distribution, which can be approximated by their computed energy. Unlike structured proteins, which have highly stable global energy minima, IDPs have numerous fluctuating heterogeneous conformations with similar energy. The IDPs can be highly sensitive to their environment, and precisely reproducing physiological conditions when performing experiments or simulation is impractical. Therefore, thoroughly finding IDP conformational ensembles is more practical and useful instead of aiming to obtain the population distribution in equilibrium under a particular physiological environment. Importantly, unlike structured proteins whose native structures are functionally important, meta-stable conformations of IDPs can be crucial in performing the biological function. For example, experiments showed that Aβ42 monomer aggregation has a rate-limiting nucleation-dependent polymerization process^50^, which suggests that the highly populated monomer conformations might not be the ones that drive nucleation, and the monomers are not pre-organized as a ready-to-aggregate conformation. Therefore, as compared with the conformational search for structure proteins, thoroughly sampling meta-stable conformations for IDPs is critical. As demonstrated in this study, the novel synthetic Aβ42 monomer conformations provide atomistic details to support several experimental observations, which also bring insights into mechanisms of oligomerization.
Conformational changes of biomolecules follow principles of physics, which are consequences of different arrangements of atoms rotating around a set of single bonds, the internal degrees of freedom having the highest flexibility. Both PCA and NMA describe the natural motions of a molecule, which are commonly used to determine essential protein dynamics and guide conformational sampling. The two techniques are linear transforms that extract the most important elements in a data matrix, a covariance or Hessian matrix. If the conformations are described by BAT coordinates, concerted dihedral rotations are directly identified by the eigenvectors: new conformations can be generated via distortion using each eigenvector and their combinations. However, real protein systems can have very complicated and higher-order correlations that have intrinsic nonlinear effects and may not be well described using standard PCA or NMA. Moreover, although PC space is commonly used for data analysis, there is no function to convert manipulated data points in PC space back to its atomistic protein structure. The activation function used in the neural network adds nonlinear effects to process nonlinear features, which allows for the autoencoder to use a reduced dimension to describe protein conformations and accurately convert the perturbed reduced-dimension coordinates back to their atomistic structures. A transition between two protein conformations is governed by certain concerted rotating dihedral angles, which can be achieved by interpolating two data points (conformations) in the latent space. As illustrated in Figure 4, if one obtains conformations along the smooth green curve in the latent space, the conformational changes reproduce the motions following the first PC mode. Notably, various arrangements can lead to the same conformational transitions, which can be revealed by interpolating between data points in the latent space. As for the natural motions identified by PCA or NMA, these conformations follow natural motions of the protein system (black dot line in Figure 4), and new conformations can be found by processing the interpolations in the latent space.
Although the ICoN model is fast, taking a few minutes to train 10,000 frames of Aβ42 and hours to perform quick energy minimization for the synthetic conformations generated from latent space interpolation using a GPU card, the training set is from a physics-based MD sampling. It typically takes days, if not weeks, to perform sufficiently long MD simulations to obtain the training set. To speed up the sampling, one can apply the ICoN model for preliminary results to select dissimilar protein conformations to seed more MD runs^19^. An advantage in using BAT-based coordinates is that one can easily select a set of dihedral angles that determine the motions of interest for training, instead of using all DOF. For small proteins or peptides such as Aβ42 and αB-crystallin57–69, our classical internal BAT (Z-matrix)-based vBAT nicely presents local sidechain rearrangements and large-scale backbone motions. However, for larger (>200 residues) proteins, rotations of some dihedral angles near the root atoms may result in unrealistic motions on the far end of the protein. The issue can be addressed easily by using multilayer BAT coordinates, assigning multiple fragments/chains for a protein or multi-protein complex. Also, reconnecting fragments with pseudo-DOF can avoid the accumulated dependence problem^59^. Future work will implement the multilayer BAT coordinates to eliminate the protein size limitation when building the ICoN model. Various interpolation and extrapolation strategies in the latent space will be examined for generating novel synthetic conformations as well.
Methods
All-atom Molecular Dynamics Simulations and Preparation of Training and Validation Data Sets
All MD simulations were performed using the AMBER20 package using either ff14SB or ff14IDPSFF force fields (Table 1) with GPU acceleration^60^. The systems were simulated using TIP3P explicit solvent model^61^ at temperature of 298K with NPT ensemble. 12 Å cutoff was used for short range non-bonded interactions and the long-range electrostatic interactions were computed by the particle mesh Ewald method (PME)^62^.
Aβ42
MD-Run1 was initiated using PDB: 2NAO with ff14IDPSFF force field (Table 1). Each frame was saved at 1-ps time integral which made up a total of 1,000,000 frames. Other MD trajectories using either PDB 2NAO or 1Z0Q as the initial structure were obtained from our previous work and detailed in Table 1^51,52,63^. Conformations of MD-Run1 to 4 were used for training, validation, and sampling conformations from latent space, while MD-Run1.1 to 3 were utilized for pairwise RMSD search for synthetic conformations obtained by Model 1 (Table 1).
aβ-crystallin57–69
AlphaFold2 computed structure of aβ-crystallin57–69 (PDB: AF_AFP02511F1) was used as initial structure for MD simulations. Two randomly seeded 500-ns MD simulations were obtained from our previous work, where each frame was collected at 1-ps interval which made up 500,000 frames^64^. Conformations of Seed1 (Table 1) were used for training, validation, and sampling conformations from latent space, while Seed2 was utilized for pairwise RMSD search for synthetic conformations.
Preparation of training and validation data sets
To construct our dataset for ICoN model training, we resaved 1 frame every 100-ps starting from 0-ps of Aβ42 MD-Run1 and 2 (10,000 conformations; 1% of raw MD) to prepare 2 training sets. Similarly, we resaved 1 frame every 100-ps starting from frame 50-ps of Aβ42 MD-Run1 and 2 for 2 validation sets. For MD-Run 3 and 4, we resaved a frame every 200 ps (0.5% of raw MD) because the classical ff14SB force field yielded less diverse conformations with favored helicity during simulations^58^ (see Ref #58 for details). As a result, we resaved a frame starting from frame 0 and frame 100 of MD-Run 3 and 4 to prepare 5,000 frames for training and validation, respectively. Similar approach is used for aβ-crystallin57–69 Seed1, where 1 frame every 50-ps starting from 0-ps (5,000 conformations; 1% of raw MD) used for training, and 1 frame every 50-ps starting from 25-ps used for validation. This ensures training and validation datasets are separated by 50-ps and 25-ps time interval for Aβ42 and aβ-crystallin57–69 respectively. Details of other training and validation datasets are shown in Table 1. This approach allows training a model from minimal amount of input data, yet generate numerous conformations with extended conformational space.
Protein Structure Representation
Classical Bond-Angle-Torsion coordinates
Bond Angle Torsion (BAT) is the internal coordinate representation^65–67^, where “bond” denotes the distance between a pair of bonded atoms, “angle” refers to the bond angle between pairs of bonds connected to a central atom, and “torsion” indicates the dihedral angle formed by four bonded atoms via three bonds. Specifically, the dihedral angle is defined as the angle between the plane containing atoms (i, j, k) and another plane containing atoms (j, k, l) (Figure 1E). BAT coordinates can be accurately transformed back to all atom Cartesian coordinates. In this, the placement of each atom i > 3 that is not bounded to atom 2 is specified by its bond length (b_i_), bond angle (α_i_), and dihedral angle (θ_i_) with respect to three other atoms that are bonded in sequence and whose positions are already defined (details in SI). Three terminal atoms, designated as root-based atoms, are used to initiate conversion from BAT to Cartesian. Root based atoms carry six external degrees of freedom such as global translations and rotations.
Vector Representation of BAT (vBAT)
Although BAT coordinates effectively capture concerted motions, they suffer from a periodicity problem. Dihedral rotation from 179° to −179° is actually a 2° shift in angular space. However, the basic arithmetic operation used in computing a covariance matrix results in a large, 179°-(−179°) = 358°, rotation^68^. Hence, we used vector representation of BAT coordinates termed vBAT as input features to ICoN model. The vBAT internal coordinate representation is equivariant under global translations and rotations, and thus provided strong inductive bias for our ICoN model. Furthermore, compared to other structural representations such as Anchored Cartesian^41^ or PCA based representations^32^, internal coordinate representation is independent of reference structure. A detailed description of vBAT computation is provided in this section and in Supplementary Information.
For a set of four bonded atoms with atom indexes i, j, k, and l (Figure 2B), we defined three bond vectors v_ij_, v_jk_, and v_kl,_ that are the relative positions between the atoms. The unit vectors that are normal to the plane made by a pair of neighboring bond vectors defined vector features, denoted as and .
vBAT vector features comprising , , , and were used as an input to ICoN model. When sampling conformations in latent space, all external degrees of freedom (root-based coordinates) were inferred from the reference conformation. Sampled features , , , and were first converted to bond angles, and torsions, and then transformed back to Cartesian coordinates in large batches using a highly optimized GPU implementation (Figure S2). Since changes in bond length are very minimal throughout MD simulations, we used fixed bond length inferring from the reference structure for all generated conformations.
Model Architecture and Training
Model Architecture
The ICoN model is an autoencoder type architecture based on fully connected neural network (Figure 2A) trained on MD data to directly output all atom protein conformations. The model is designed to compress an input representation into lower dimensional latent space (3D) with an encoder that can be directly visualized. These points representing conformations in 3D latent space are decompressed back to the original input dimension with decoder.
Initially, vBAT vectors were flattened followed by concatenation into a high dimensional array to use as input features to the ICoN model. As a result, we have 4×3x(N-3) features for each molecule 2376, and 7488 features for αB-crystallin57–69 and aβ42 respectively. The activation units were gradually reduced to 3-dimensional latent space in seven layers, where each molecular conformation can be presented in the 3D latent space, as illustrated in Figure 2A. For both systems, the encoding process reduced the dimension from n_f_, to n_f_/4, n_f_/8, n_f_/16, n_f_/32, n_f_/64, n_f_/64, and 3, where n_f_ is the total number of input features. Then, the decoding process brought the representation from 3D to 2376 dimension with the same number of layers but reverse the order.
The LeakyReLU activation function was applied to all layers except the last layer. Layer normalization is applied in the second layer. To prevent overfitting, 10% of the weights were randomly set to zero in the third layer, which led to better convergence of validation loss. The decoder takes the latent vector as input and decompresses it back to its original feature dimension. The decoder also featured LeakyReLU nonlinearity but did not include any layer normalization or dropout layers. The bias units are omitted aiming to minimize the model’s parameter count. Meanwhile, all model parameters were initialized through the Xavier Uniform weight initialization method, which is known to enhance training stability and efficiency^69^. The total count of model parameters utilized is 3,294,964 for αB-crystallin57–69 and 32,721,156 for aβ42. Throughout the iterative training process, in addition to learning a relationship between vast number of features, the network also learns how to meaningfully compress the representation into the lower dimensional 3D latent space by performing nonlinear dimensionality reduction. Each conformation is converted from vBAT to Cartesian for further analysis.
Loss Function
The combination of L1 and L2 loss functions, the Smooth L1 loss, was used to minimize the difference between the model output (y_i_) and the original input (x_i_). The equation for the smooth L1 loss function, L, is given below (Eq. 2):
where, (x_i_-y_i_)^2^ is L2 loss and |x_i_-y_i_| is L1 loss.
The L1 and L2 functions intersect at β=1.0 resulting in a smooth, continuous loss function that is less sensitive to outliers compared to the L1 loss. However, for larger errors, the function transitions to an L1 loss, ensuring robustness to extreme values. As a result, using Smooth L1 loss can lead to more stable training and better convergence.
Optimization was performed using the Adam optimizer^70^ for both protein systems. For Aβ42 training, the learning rate was manually reduced from 0.002 to 0.001 after 5000 iterations, resulting in a total of 10,000 training iterations. The learning rate was kept constant at 0.001 for aβ-crystallin57–69, comprising a total of 5000 training iterations. We fed 200 conformations in mini-batches per iteration to the model, randomly selecting frames from the data pool during each training iteration. Training was terminated upon convergence of both the training and validation loss, a process that took a few minutes for both systems. The MDAnalysis Python library^71^ used as the tool of choice for reading and writing trajectories.
Implementation
In this study, we employed the PyTorch library to implement all deep learning algorithms^72^. MDAnalysis, a Python library, facilitated MD trajectory input/output operations and torsion list construction^71^. Leveraging the torch tensor data structure in PyTorch, we implemented GPU-optimized BAT and vBAT calculations. The code and pre-trained ICoN model parameters can be accessed at the following link: https://github.com/chang-group/ICoN
Evaluation of reconstructed conformations
The accuracy of the ICoN model was assessed using the validation dataset. Initially, vBAT features extracted from the validation set were fed into the pretrained IcoN model, followed by a forward pass. Subsequently, the model’s output was converted into fully atomistic Cartesian coordinates. Reconstruction accuracy was measured by computing the RMSD of all heavy atoms (including backbone atoms) between the original and reconstructed conformations. Heavy atom RMSD calculation involved all non-hydrogen atoms, while backbone RMSD estimation focused on N, Cα, C, and O atoms (refer to Table 1).
Furthermore, we conducted more comprehensive comparisons between the original and reconstructed conformations, examining factors such as the distribution of all dihedral angles (including both backbone and sidechain for all amino acids) and their correlations. To address periodicity issues in angular space, dihedral angle correlations were computed using a trigonometric function (see details in the Supplementary Information)^59,68,73^. These analyses were performed to verify that all properties of the original conformations are faithfully reproduced (see Figure S3–S6).
Generation of Novel Conformations
Step1: Non-linear Interpolation
In order to generate synthetic conformations, we first encode training and validation data set into the latent space. Then, non-linear interpolation is performed between each pair of consecutive points in 3D latent space. Interpolation involves generation of path that moves between selected pair of points in an arc gradually changing polar angle Θ, azimuthal angle Φ, and radius R in a spherical coordinate system. The center of the sphere is located at the midpoint between two points, and random adjustments to the center position were made to generate a distinct path (see details in the Supplementary Information). 10 points were interpolated from each consecutive pair of points yielding a total of 19,9990 and 99,990 interpolated conformations for Aβ42 MD-Run1 and aβ-crystallin57–69-Seed1, respectively (Table 1).
Step 2: Eliminating non-physical conformations.
To avoid steric clashes or non-physical conformations, all generated structures were subjected to a short minimization using a modified Generalized Born with implicit solvent model (mbondi3, igb=8) for 1000 steps. We applied energy cutoff of −400 (kcal/mol) and −200 (kcal/mol) for Aβ42 and aβ-crystallin57–69 respectively. Energy cutoff values are selected from more positive low population regions of total energy distribution. All conformations with energy higher than the cutoff were excluded (Table 1). Energy minimization did not affect the accuracy of prediction but is essential to correct for the physically unrealistic conformations. To have a fair comparison between various energy terms, the original MD trajectories were also minimized using a similar strategy involving 50 steps.
Step3: Elimination of repeated conformations with pairwise RMSD search.
After energy-based elimination, we further perform pairwise heavy atom RMSD search with 1Å cutoff across all remaining conformations in order to eliminate repeated conformations. 1 Å RMSD difference can capture small variations in sidechain as well as salt bridge formation or breaking. This step is essential to ensure all conformations are unique and diverse. Table 1 shows the number of remaining conformations for all MD runs.
Step4: Elimination of existing conformations in raw MD.
We further compare remaining synthetic conformations from previous step with all the conformations in raw MD using pairwise heavy atom RMSD search. Conformations with the pairwise RMSD lower than 2 Å cutoff were deemed similar and excluded from synthetic pool for both Aβ42 and aβ-crystallin57–69 (Table 1). All pairwise RMSD searches were performed using PTRAJ^74^.
Dihedral PCA (dPCA)
To extract conformations with major protein motions, we utilized dihedral Principal Component Analysis (dPCA) employing torsion angles with in house code. The resulting conformations, derived from the first three principal components, were then projected onto a latent space using the IcoN model in order to get insight on conformational transition of the aβ-crystallin57–69 300-ps trajectory, composed of 300 frames. Notably, the transitions observed in the latent space, governed by principal component modes, manifested as smooth nonlinear trajectories. Consequently, we opted for nonlinear interpolation to produce synthetic conformations.
The dPCA calculation involves mapping torsion angles to a unit circle using trigonometric functions. This enables accurate estimation of differences (involving subtraction) and averages (involving summation) during the construction of the covariance matrix, thereby preventing erroneous computation of their correlation at the discontinuity margin (±180° or 360°/0°)^59,68,73^.
Analysis of synthetic conformations
Free energy profiles
To derive free energy profiles, we initially constructed a two-dimensional histogram by binning the radius of gyration (Rg) and backbone root mean squared deviation (RMSD) of protein conformations with respect to a reference structure. Subsequently, we determined the normalized population (P) for each bin throughout the entire trajectory. The free energy values were then computed using the equation F = −k_B_T log(P), where k_B_ represents the Boltzmann constant and T is a room temperature.
Contact maps
The contact maps were generated by applying a cutoff of 6.5 Å to all intramolecular distances between C-alpha positions. A contact density of one is assigned when a contact is formed, and zero otherwise, for each frame. Local contact densities of residue pairs with indexes i and j, where |i-j| < 4 were set to zero. The average across all frames (conformations) yielded the contact probability between the pairs of residues.
Aβ42 turn characterization
Four local turns have been identified as turn A (F4-H6), turn B (E11-H14), turn C (S26-K28), and turn D (V36-G38). To selectively isolate conformations characterized by turns occurring in specific regions, we initially identified residue pairs exhibiting a high pair contact density in proximity to the corresponding turn region, as depicted in contact map of Figure 7.
Subsequently, a turn was deemed to be formed if the C-alpha distance between the selected pair of residues was less than 6.5 Å. Conversely, a turn was classified as not formed when the contact distance exceeded 9.0 Å. For turns A, B, C, and D, the residue pairs utilized for contact identification are Ala2:Asp7, Tyr10:Gly15, Val24:Gln29, and Gly33:Val40, respectively are selected.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Miller M. D.; Phillips G. N. Moving beyond Static Snapshots: Protein Dynamics and the Protein Data Bank. Journal of Biological Chemistry 2021, 296, 100749. 10.1016/j.jbc.2021.100749.33961840 PMC 8164045 · doi ↗ · pubmed ↗
- 2Dill K. A.; Bromberg S. Molecular Driving Forces: Statistical Thermodynamics in Biology, Chemistry, Physics, and Nanoscience; Garland Science, 2011.
- 3Zhang Y.; Chen H.; Li R.; Sterling K.; Song W. Amyloid β-Based Therapy for Alzheimer’s Disease: Challenges, Successes and Future. Signal Transduct Target Ther 2023, 8 (1), 248. 10.1038/s 41392-023-01484-7.37386015 PMC 10310781 · doi ↗ · pubmed ↗
- 4Huang B.; Kong L.; Wang C.; Ju F.; Zhang Q.; Zhu J.; Gong T.; Zhang H.; Yu C.; Zheng W.-M.; Bu D. Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms. Genomics Proteomics Bioinformatics 2023, 21 (5), 913–925. 10.1016/j.gpb.2022.11.014.37001856 PMC 10928435 · doi ↗ · pubmed ↗
- 5Whitford D. Proteins: Structure and Function; Wiley, 2013.
- 6Tripathi T.; Dubey V. K. Advances in Protein Molecular and Structural Biology Methods; Elsevier Science, 2022.
- 7Macuglia D.; Roux B.; Ciccotti G. The Emergence of Protein Dynamics Simulations: How Computational Statistical Mechanics Met Biochemistry. The European Physical Journal H 2022, 47 (1), 13. 10.1140/epjh/s 13129-022-00043-y. · doi ↗
- 8Hollingsworth S. A.; Dror R. O. Molecular Dynamics Simulation for All. Neuron 2018, 99 (6), 1129–1143. 10.1016/j.neuron.2018.08.011.30236283 PMC 6209097 · doi ↗ · pubmed ↗
