Decrypting cryptic pockets with physics-based simulations and artificial intelligence

Si Zhang; Gregory R. Bowman

PMC · DOI:10.1016/j.sbi.2025.103215·March 4, 2026

Decrypting cryptic pockets with physics-based simulations and artificial intelligence

Si Zhang, Gregory R. Bowman

PDF

Open Access

TL;DR

This review discusses how physics-based simulations and AI help identify hidden protein pockets for drug discovery.

Contribution

The paper reviews hybrid computational strategies combining physics and AI for discovering cryptic pockets.

Findings

01

Physics-based simulations improve detection of transient protein pockets.

02

AI-driven models enhance the functional interpretation of cryptic pockets.

03

Hybrid methods offer better accuracy in identifying druggable sites.

Abstract

Cryptic pockets are promising targets for drug discovery that greatly expand the druggable proteome. In particular, they can provide opportunities to target proteins previously thought to be “undruggable” due to a lack of pockets in structures of the ground state. However, their transient and hidden nature renders them difficult to detect through conventional experimental screening methods. Recent advances in computational methodologies and resources have greatly enhanced our ability to identify and characterize such elusive pockets. This review highlights key developments in computational approaches, including physics-based molecular dynamics simulations, artificial intelligence–driven models, and hybrid strategies that integrate both to enhance cryptic pocket discovery and functional interpretation.

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

Dhfr

Proteins1

Chemicals7

AMG 510 VP35 thiol nopaline MRTX849 Water xenon

Mutations3

G12Dcysteine 12G12C

Figures3

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Degradation and Inhibitors · Computational Drug Discovery Methods · vaccines and immunoinformatics approaches

Full text

Introduction

Cryptic pockets are transient binding sites that arise from conformational rearrangements driven by a protein’s thermal fluctuations. Opening of these pockets is rare, so they are typically closed, and therefore hidden, in experimental structures. This situation is likely exacerbated by the fact that most protein structures are determined under cryogenic conditions (~100 K), which severely restrict conformational flexibility [1,2]. As a result, identifying cryptic pockets is challenging even though they are thought to be quite common [3,4]. Most known cryptic sites were discovered serendipitously by solving structures for hits from high-throughput screens that happened to find molecules that bind and stabilize an open cryptic pocket [5–8]. It would be far better to have a means to discover cryptic pockets without needing molecules that bind them. Then one could use this knowledge of a pocket’s structure/location to intentionally target the pocket.

The ability to identify and target cryptic pockets would provide a host of new opportunities for drug discovery. In particular, they could provide a means to target proteins historically deemed “undruggable” due to lack of apparent binding pockets. Targeting cryptic pockets could provide greater specificity than targeting functional sites (e.g. for kinases, where molecules that target the active site of one kinase are likely to inhibit other kinases). Many cryptic pockets are allosterically coupled to functional sites, opening up the possibility of enhancing desirable functions in addition to inhibiting undesirable functions [8–14].

Despite the challenges that cryptic pockets pose, recent successes highlight their therapeutic potential. For example, the discovery of a hidden switch-II pocket in KRAS G12C led to the development of two Food and Drug Administration—approved covalent inhibitors, AMG 510 and MRTX849 [15,16]. These compounds selectively bind to cysteine 12 within a cryptic site. Similarly, in CB1, the discovery of an extended cryptic pocket bridging the orthosteric site and the conserved signaling residue D^2.50^ enabled the rational design of a ligand, VIP36 [17].

Advances in computational methods are helping to make the discovery and targeting of cryptic pockets more routine. In particular, molecular dynamics (MD) simulations offer a powerful means to capture protein dynamics and reveal transient structural changes that create cryptic pockets [17,18]. Such motions, ranging from side chain and loop fluctuations to interdomain rearrangements, occur across diverse timescales as illustrated in Figure 1. Enhanced sampling techniques can further improve detection of rare, high-energy conformational states that are critical for the emergence of cryptic sites [19–21]. Meanwhile, advances in artificial intelligence (AI), such as AlphaFold (AlphaFold3) [22,23], have added new capabilities to structure prediction and cryptic pocket exploration [24,25].

Hybrid approaches combining physics-based simulations with machine learning models, such as PocketMiner [3] and BioEmu [26], have opened new avenues for structural and functional predictions.

Here, we review recent progress in uncovering cryptic pockets (Table 1, Figure 2) using molecular simulations and AI-driven models (individually and in combination) and explore how these tools are expanding our ability to identify cryptic site formation and their implications for drug discovery.

Exploring cryptic pockets through molecular simulations

MD simulations have become a powerful technique for investigating cryptic pockets in proteins [17,18,27–30]. In early studies, docking was directly performed on MD-generated snapshots containing transient pockets to assess ligand binding potential and pocket properties. However, even long-timescale MD simulation, spanning hundreds of microseconds, can fall short in capturing transient, high-energy pocket states in certain systems.

For instance, pockets that emerge from secondary structure motions or changes may occur on timescales ranging from microseconds to minutes, often exceeding the practical limits of conventional MD simulations.

To overcome these limitations and accelerate the exploration of relevant conformational space, enhanced sampling techniques, such as the fluctuation amplification of specific traits (FAST), have been introduced. FAST efficiently guides simulations toward conformations exhibiting desired structural features, like increased inter-residue distances or increased pocket volumes, by iteratively running simulations, building a map of the space explored so far, and using that map to decide where to gather more data [46].

Markov state models (MSMs) are commonly used to map the conformational landscape sampled by MD [31–36]. MSMs discretize this landscape into a finite set of states, from which representative conformations with well-defined cryptic pockets can be extracted. Ligand binding can then be evaluated using advanced methods such as Boltzmann docking [47] and PopShift [48], both of which account for the probabilities of different states and how strongly a ligand binds each state. This MSM-based framework has been proven to be effective for modeling the binding of known ligands that target alternative sites (e.g., orthosteric regions) and for virtually screening compound libraries against cryptic pockets. Notably, it has enabled the successful discovery of ligands for targets like TEM-1 β-lactamase and the 5-HT_3A_ receptor [14,36].

A particularly compelling example of this framework is the discovery of a cryptic pocket in the interferon inhibitory domain of Zaire Ebola VP35 (VP35) [34]. A combination of FAST and large-scale simulations on the Folding@home distributed computing platform [49,50] were used to explore the protein’s conformational space. To identify cryptic pocket within this large ensemble, exposon analysis was employed [51]. Exposons are groups of residues that undergo highly correlated changes in solvent exposure and are often associated with cryptic site formation. This analysis revealed a cryptic pocket that forms when a small helix (helix 5) separates from a 4-helix bundle (Figure 3). This pocket was found to be allosterically coupled to the double-stranded RNA (dsRNA) blunt-end-binding interface, which was captured in a separate exposon. The correlation of all rotameric and dynamical states algorithm [52] further validated this allosteric connection by quantifying the coupling between residue pairs through their dihedral angle dynamics. To gain mechanistic insight, the DiffNets machine learning algorithm [53], a supervised autoencoder architecture designed to identify key structural differences between ensembles (e.g. open vs. closed states), was applied. DiffNets revealed strong coupling between cryptic pocket opening/closing and the structural preferences of a key RNA-binding residue, F239. Subsequent experiments confirmed the existence of the pocket and its allosteric control over RNA binding. This study represents a rare example of successfully targeting a difficult, nonenzymatic protein involved in protein—nucleic acid interactions, demonstrating the therapeutic potential of leveraging cryptic pockets to modulate such challenging targets. A follow-up study demonstrated that the open state of the pocket has a biological function, which is an exciting insight [54].

Metadynamics is another promising approach for studying cryptic pockets. It is particularly effective at enhancing sampling along predefined conformational coordinates, known as collective variables (CVs), thereby facilitating access to otherwise rare structural states [41]. For example, Benabderrahmane et al. systematically identified cryptic pockets in the antiapoptotic protein, Mcl-1 by performing well-tempered metadynamics on essential coordinate space [42]. These coordinates, used as CVs, were derived from essential dynamics captured via principal component analysis of an initial 100-ns unbiased MD simulation, allowing enhanced sampling simulations to focus on the dominant protein motions.

More recently, Vithani et al. applied an advanced variant of weighted ensemble (WE) MD simulations to investigate cryptic pockets in both wild-type KRAS and its G12D mutant [19]. In their approach, normal mode analysis was used to derive inherent normal modes, which then served as progress coordinates to guide simulations toward relevant conformational transitions. Notably, their study also incorporated mixed-solvent MD (MSMD) simulations, in which small cosolvent molecules were introduced as probes to identify hydrophobic or transiently accessible cavities. MSMD not only enables the detection of cryptic sites but also provides quantitative characterization of pocket properties, such as estimated binding free energies through cosolvent occupancy analysis [37–40,55]. Its growing popularity stems from its ability to mimic ligand—induced-fit effects and uncover dynamic binding hotspots that may take longer to open in aqueous-only simulations. Despite its strengths, MSMD alone has limitations, including challenges in achieving sufficient sampling, risks of protein destabilization, and the need for careful selection of probe molecules [56]. To overcome these issues and harness the strengths of MSMD, it is increasingly being integrated with other simulation techniques. In this study, Vithani et al. combined MSMD with WE simulations using xenon as the chemical probe [19], allowing simultaneous capture of induced-fit dynamics and quantification of cosolvent occupancy and residence times. Leveraging over 400 μs of simulation data, the authors conducted comprehensive analyses of cryptic pockets in both KRAS and its G12D mutant. The analyses included probe occupancy mapping, exposon analysis, and a modified version of exposon termed as dynamic probe binding analysis, which calculates the correlated changes in xenon binding. Together, these approaches provided mechanistic insights into pocket flexibility and revealed key allosteric networks with respect to the cryptic pockets in KRAS.

Another promising method is Sampling Water Interfaces through Scaled Hamiltonians (SWISH), a Hamiltonian Replica Exchange (HREX)-based technique devised by Oleinikovas et al. [21]. In combination with small organic probes, SWISH progressively scales the nonbonded interactions between solvent molecules and apolar protein atoms, effectively shifting the water properties toward more ligand-like behavior and thereby facilitating the opening of cryptic sites. This approach has been shown to explore conformational changes with high activation barriers and successfully induce the formation of known cryptic binding sites in targets TEM-1 β-lactamase, interleukin-2, and Polo-like kinase-1. Compared to conventional long-timescale MD or parallel tempering simulations, SWISH offers improved accuracy and sampling efficiency for uncovering cryptic conformations. Building on this, the same group recently developed SWISH-X, an enhanced version of SWISH that combines OPES MultiThermal for faster and more accurate cryptic pocket exploration across diverse systems [20].

While these approaches show strong potential for cryptic pocket detection, their application in drug discovery to unexplored proteins still requires extensive downstream analysis. This includes quantitative characterization of pocket properties, such as assessing pocket druggability and functional relevance [9,57–62], as well as experimental tests using methods like thiol labeling or fragment-based screening [32,34,63,64].

Identifying cryptic pockets using AI-driven models

With the rapid advancement of AI, a new generation of AI-driven models has emerged to tackle the challenges of cryptic pocket detection. These models can either predict the likelihood of individual residues participating in cryptic pockets or generate open pocket conformations far more efficiently than conventional MD simulations.

Structure prediction tools like AlphaFold have demonstrated remarkable accuracy in modeling protein structures [22]. Through stochastic sampling of input multiple sequence alignments, AlphaFold can produce structural ensembles that may exhibit open or partially open cryptic pockets. However, despite its success, AlphaFold was primarily trained on experimentally determined structures from the Protein Data Bank (PDB) and large sequence databases. It was not explicitly designed to sample alternative conformations, which limits its ability to directly predict cryptic pockets. For instance, a study by Meller et al. showed that AlphaFold could recapitulate cryptic pockets in only 6 out of 10 proteins [24]. The predicted structural ensembles also lack information about the relative weights of different conformations, making it unclear how likely a pocket-opening event is. Additionally, some AlphaFold-predicted structures contain highly flexible domains with low prediction confidence, further limiting applicability. Nevertheless, these predicted structures can serve as valuable starting points for further exploration using unbiased or biased MD simulations [24,25]. In Meller et al.’s study, subsequent MD simulations followed by MSMs analysis provided deeper insights into pocket dynamics that were not accessible from AlphaFold alone. Starting simulations from these predicted structures also accelerated pocket discovery compared to starting from crystal structures. With the recent release of AlphaFold3, which supports joint structure prediction of proteins and small molecules, there is potential for improved identification of open pocket conformations and ligand-induced structural changes [23].

In parallel, sequence-based AI models have recently emerged as powerful tools for protein structure and function prediction, including the assessment of cryptic pocket propensity at the residue level without requiring structural input. Škrhák et al. utilized three different protein language models, ProtT5-XL-U50, ESM-1b, and ProtBert-BFD, to generate residue-level embeddings, which were then fed into a neural network for prediction [43]. These embeddings capture rich contextual information from protein sequences, enabling accurate cryptic site predictions directly from sequence data. Model performance was evaluated using the CryptoSite dataset, which comprises 93 apo—holo protein structure pairs containing validated cryptic pockets [4]. Notably, predictions based on ProtT5-XL-U50 and ESM-1b embeddings outperformed ProtBert-BFD and slightly surpassed CryptoSite, a structure-based model, in terms of the area under the curve (AUC) on the test set. To support the development and evaluation of cryptic pocket predictors, the same group introduced CryptoBench, a larger and more comprehensive benchmark set consisting of 1107 apo—holo protein pairs curated using pocket RMSD (root mean square deviation) as the selection criterion [44]. Using this dataset, they trained a new neural network model using embeddings from the ESM2–3B model. Although this model achieved higher AUC scores on the test set compared to PocketMiner, a structure-based predictor, it is important to note that the two models were trained on different datasets.

Advancing the field further, Martinez et al. constructed the largest known database of cryptic sites to date, comprising over 5.5 million structural alignments of apo and holo protein pairs from the PDB [45]. Using a curated dataset of 71 cryptic and 128 non-cryptic examples, they trained a supervised machine learning model to detect ligand-induced conformational changes and score cryptic pocket formation, ultimately identifying approximately 2,00,000 apo—holo combinations containing potential cryptic sites. Building on this dataset, the authors then fine-tuned a protein language model (Prot-T5-XL-UniRef50) to predict cryptic pocket locations directly from the sequence. Although this model demonstrated high prediction performance when query sequences share over 20 % sequence identity with CryptoBank entries, its generalizability to novel sequences remained limited. Additionally, analyses of ligand molecular weight and relative solvent accessible surface area (RSA) revealed that many cryptic pockets tend to accommodate larger ligands (molecular weight >300 Da) and are located in deeper, less solvent-exposed regions (RSA <0.3) rather than on the surface (RSA >0.3). Based on these insights, they curated a refined fragment library of ~ 6000 clustered ligands designed specifically to screen for cryptic binding pockets.

Together, these AI-driven approaches complement physics-based simulations by enabling rapid screening and prioritization of potential cryptic sites, especially valuable for targets with limited structural information or where experimental data are sparse.

Integrating AI with simulation data to predict cryptic pockets

In addition to sequence-based models, structure-based machine learning models represent another powerful class of tools for cryptic pocket predictions, leveraging protein structural ensembles sampled from MD simulations. A notable early example is CryptoSite, which requires an input protein structure and was trained to identify residues that transition from an orientation incompatible with ligand binding to one that accommodates a ligand [4]. Its training was based on a set of 84 confirmed cryptic pockets derived from the PDB. While CryptoSite achieves good accuracy in classifying pocket-forming residues, its application is computationally expensive―it requires generating simulation data on-the-fly as one of the input features, taking approximately one day per input structure.

A more recent and efficient alternative is PocketMiner, which predicts whether each residue in a given protein structure will participate in the formation of a cryptic pocket during a short MD simulation initiated from that structure [3]. The prediction is typically generated within seconds. PocketMiner is built on a geometric vector perceptron (GVP)-based graph neural network, designed to learn residue-level representations from diverse protein conformations sampled through simulations. For each input structure, the model extracts structural features, such as dihedral angles and inter-residue directions/distances, processes them through GVP layers, and updates the residue embeddings via message passing layers. The final residue-level predictions are obtained using a sigmoid activation function. The model was trained on a curated dataset of 38 proteins containing 39 experimentally confirmed cryptic pockets. A large number of simulations were performed on these proteins, capturing thousands of cryptic pocket-opening events. From these simulations, residue-level training labels were generated by measuring changes in LIGSITE pocket volume and the maximum fpocket druggability score in the vicinity of each residue. The final dataset provided sufficient structural diversity and reliable dynamic labels to support effective model training. Compared to CryptoSite, PocketMiner achieves slightly higher accuracy, as measured by the area under the receiver operating characteristic curve (ROC-AUC: 0.87 vs. 0.85), while offering over 1000-fold faster prediction speed.

One other promising model is BioEmu, which integrates over 200 ms of simulation data, static structures, and experimental protein stabilities through a unique training paradigm [26]. Notably, a significant portion of its training data comes from the large-scale distributed computing resource Folding@home [33]. BioEmu predicts diverse functional motions, including cryptic pocket formation, by generating structure ensembles. In a benchmark of 34 experimentally validated cryptic pocket cases, the model successfully recovered 86 % of holo structures. However, its performance on apo conformations was lower, with only 56 % accurately predicted, highlighting the need for further improvements, especially in modeling unbound structures or better balancing apo and holo representations during training.

Conclusions

Cryptic pockets present exciting opportunities for drug discovery, for example, they can enable the targeting of proteins that were previously considered “undruggable”. However, their transient and dynamic nature makes them inherently difficult to detect using static structure-based experimental screening. Physics-based MD simulations, particularly when combining with advanced techniques such as FAST, MSMs, mixed-solvent approaches, and enhanced sampling have proven effective in capturing the dynamic formation and properties of these hidden pockets. Meanwhile, AI-driven models offer rapid and scalable alternatives capable of predicting cryptic pocket locations and dynamics with growing accuracy and interpretability. Despite these advances, key challenges remain: MD simulations are computationally expensive and often system-specific, while AI models may yield unphysical results or struggle to generalize across diverse proteins. A major limitation is the lack of large-scale, experimentally tested datasets for benchmarking. Future efforts should focus on building high-quality training data, integrating AI with physics-based methods, improving model interpretability, and continuous pairing predictions with experimental validation to accelerate cryptic pocket discovery and drug design.

Bibliography64

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Fraser JS, van den Bedem H, Samelson AJ, Lang PT, Holton JM, Echols N, Alber T: Accessing protein conformational ensembles using room-temperature X-ray crystallography. Proc Natl Acad Sci U S A 2011, 108:16247–16252.21918110 10.1073/pnas.1111325108 PMC 3182744 · doi ↗ · pubmed ↗
2Garman E: ‘Cool’ crystals: macromolecular cryocrystallography and radiation damage. Curr Opin Struct Biol 2003, 13: 545–551.14568608 10.1016/j.sbi.2003.09.013 · doi ↗ · pubmed ↗
3Meller A, Ward M, Borowsky J, Kshirsagar M, Lotthammer JM, Oviedo F, Ferres JL, Bowman GR: Predicting locations of cryptic pockets from single protein structures using the Pocket Miner graph neural network. Nat Commun 2023, 14: 1177.36859488 10.1038/s 41467-023-36699-3PMC 9977097 · doi ↗ · pubmed ↗
4Cimermancic P, Weinkam P, Rettenmaier TJ, Bichmann L, Keedy DA, Woldeyes RA, Schneidman-Duhovny D, Demerdash ON, Mitchell JC, Wells JA, : Crypto Site: expanding the druggable proteome by characterization and prediction of cryptic binding sites. J Mol Biol 2016, 428: 709–719.26854760 10.1016/j.jmb.2016.01.029PMC 4794384 · doi ↗ · pubmed ↗
5Horn JR, Shoichet BK: Allosteric inhibition through core disruption. J Mol Biol 2004, 336:1283–1291.15037085 10.1016/j.jmb.2003.12.068 · doi ↗ · pubmed ↗
6Günther S, Reinke PYA, Fernández-García Y, Lieske J, Lane TJ, Ginn HM, Koua FHM, Ehrt C, Ewert W, Oberthuer D, : X-ray screening identifies active site and allosteric inhibitors of SARS-Co V-2 main protease. Science (1979) 2021, 372: 642–646.10.1126/science.abf 7945 PMC 822438533811162 · doi ↗ · pubmed ↗
7Knoverek CR, Amarasinghe GK, Bowman GR: Advanced methods for accessing protein shape-shifting present new therapeutic opportunities. Trends Biochem Sci 2019, 44: 351–364.30555007 10.1016/j.tibs.2018.11.007PMC 6422738 · doi ↗ · pubmed ↗
8Ostrem JM, Peters U, Sos ML, Wells JA, Shokat KM: K Ras(G 12C) inhibitors allosterically control GTP affinity and effector interactions. Nature 2013, 503:548–551.24256730 10.1038/nature 12796 PMC 4274051 · doi ↗ · pubmed ↗