KRASAVA—An Expert System for Virtual Screening of KRAS G12D Inhibitors
Oleg V. Tinkov, Pavel E. Gurevich, Sergei A. Nikolenko, Shamil D. Kadyrov, Natalya S. Bogatyreva, Veniamin Y. Grigorev, Dmitry N. Ivankov, Marina A. Pak

TL;DR
KRASAVA is a new system that helps identify promising KRAS G12D inhibitors using machine learning and virtual screening.
Contribution
KRASAVA introduces a QSAR-based framework with virtual screening and applicability domain assessment for KRAS G12D inhibitors.
Findings
The consensus QSAR model achieved a Q2 test value of 0.70 and covered 78% of data within the applicability domain.
Molecular docking of proposed inhibitors showed strong consistency with QSAR results and key interactions with key residues.
The system includes bioavailability rules and toxicophore screening to prioritize viable drug candidates.
Abstract
The development of KRAS G12D inhibitors represents an effective therapeutic strategy for treating oncological pathologies. Existing quantitative structure-activity relationship (QSAR) models for KRAS G12D inhibitors have several limitations, primarily the lack of applicability domain determination and virtual screening implementation. In this study, we propose a set of regression QSAR models for KRAS G12D inhibitors by employing various molecular descriptors and machine learning methods. Our consensus model achieved a Q2 test value of 0.70 on an external test set, covering 78% of the data within the applicability domain. We integrated this consensus model into our Python-based framework KRASAVA. The platform predicts inhibitory activity while considering the applicability domain, assesses compounds for compliance with Muegge’s bioavailability rules, and identifies PAINS, toxicophores,…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22
Figure 23
Figure 24
Figure 25
Figure 26
Figure 27
Figure 28
Figure 29
Figure 30
Figure 31
Figure 32
Figure 33
Figure 34
Figure 35
Figure 36
Figure 37
Figure 38
Figure 39
Figure 40
Figure 41
Figure 42
Figure 43
Figure 44
Figure 45
Figure 46
Figure 47
Figure 48
Figure 49- —budget of the Institute of Physiologically Active Compounds of the Russian Academy of Sciences (IPAC RAS), State Targets—2024
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsProtein Kinase Regulation and GTPase Signaling · Receptor Mechanisms and Signaling · Computational Drug Discovery Methods
1. Introduction
The KRAS gene (Kirsten rat sarcoma viral oncogene) encodes a protein from the Ras GTPase family playing a central role in cellular signaling by regulating proliferation, differentiation, and cell survival [1,2]. Normally, the KRAS protein cycles between an active (GTP-bound) and inactive (GDP-bound) state. The G12D mutation, involving the mutation of glycine with aspartic acid at position 12, is among the most common and aggressive KRAS mutations [3,4,5,6]. The G12D mutation constitutively activates KRAS, directly causing uncontrolled cell proliferation, tumor formation, and alterations in the tumor microenvironment. This direct causal link makes KRAS G12D a critically important oncological target, whose inhibition can fundamentally alter disease progression [7]. The G12D mutation locks KRAS in the active state, leading to continuous stimulation of downstream signaling pathways such as MAPK/ERK and PI3K/AKT. Persistent activation of these pathways drives uncontrolled cell growth, tumor formation, and modification of the tumor microenvironment [8]. The G12D mutation is the most common subtype of KRAS mutations in pancreatic cancer and a frequently occurring dominant subtype in colorectal cancer [9]. Consequently, G12D is not merely a marker but an active driver of cancer pathogenesis. Targeting this mutation addresses the core mechanism of oncogenesis, explaining its importance in drug development [10]. Recent progress in the development of KRAS G12D inhibitors, including MRTX1133 and ASP3082, demonstrate the clinical relevance of targeting the G12D mutation [11,12].
In modern pharmacology, drug discovery remains a complex and resource-intensive process. Quantitative structure–activity relationships (QSAR) serve as a fundamental statistical tool to correlate the chemical structure of molecules with their biological activity through molecular descriptors [13]. QSAR offers significant value by predicting the biological activity of novel compounds, thereby reducing costly and labor-intensive experimental assays and accelerating the drug discovery timeline. QSAR models play a pivotal role in the early stages of drug development, enabling the efficient elimination of compounds with undesirable properties prior to extensive laboratory testing [14]. A critical aspect of drug development is the prediction and understanding of toxicity and bioavailability profiles, as toxicity causes preclinical failure for approximately 30% of compounds [15], and inadequate pharmacokinetics eliminates up to 15% of candidates at the preclinical stage [16].
For example, MRTX1133 is a highly selective, non-covalent inhibitor of the mutant KRAS G12D protein. Preclinical studies demonstrated its exceptional potency and specificity (IC_50_ < 2 nM, ~700–1000-fold selectivity over KRAS WT) and induced significant tumor regression in xenograft and immunocompetent mouse models [17,18]. Despite the promise of MRTX1133 as a KRAS G12D inhibitor, preclinical pharmacokinetic studies revealed challenges for its therapeutic development. A 2024 study [19] in rats reported very low oral bioavailability of MRTX1133 at only 2.92% and a short plasma half-life of 1.12 h after oral administration. These data indicated potential difficulties in achieving and maintaining adequate therapeutic concentrations in humans, a critical factor for successful clinical translation. Clinical trials of MRTX1133 (phase 1/2), initiated in March 2023, were prematurely terminated in early 2025 after phase 1 completion. The reasons included unstable and inadequate pharmacokinetics alongside highly variable and unsatisfactory bioavailability data [20].
This example highlights the necessity of thoroughly evaluating both inhibitory activity and ADMET properties for KRAS G12D inhibitors. Traditional oral bioavailability evaluation relies on rules introduced by Lipinski [21], Muegge [22], Ghose [23], Veber [24], and Egan [25]. Among these, Muegge’s rules incorporate the largest set of parameters for bioavailability assessment, making them especially useful at early development stages. Muegge’s rules define an expanded drug-likeness filter that sets explicit thresholds—molecular weight between 200 and 600 Da, logP ≤ 5, ≤10 hydrogen-bond acceptors, ≤5 hydrogen-bond donors, ≤15 rotatable bonds, TPSA ≤ 150 Å^2^, and ≤7 rings—to rigorously prioritize bioavailability compounds during early-stage virtual screening. Notably, MRTX1133 does not comply with Muegge’s criteria, showing a molecular weight exceeding 600 Da and possessing more than seven ring systems. Various medicinal chemistry filters are applied early in drug development to assess compound compliance with established bioavailability rules and to exclude structural alerts such as Brenk filters and PAINS. Brenk filters consist of 105 structural fragments that increase toxicity risk, impair pharmacokinetic properties, and generally reduce suitability for drug candidates [26]. Pan-assay interference compounds (PAINS) represent chemical entities that frequently produce false-positive results in high-throughput screening. PAINS exhibit nonspecific interactions across multiple biological targets rather than selectively acting on the intended target [27].
Given the relevance of KRAS inhibition as a therapeutic strategy, several research groups have developed satisfactory QSAR models linking chemical structure to inhibitory activity against KRAS [28,29,30,31]. Srisongkram et al. constructed robust QSAR models using a dataset of 1033 compounds, achieving = 0.60 and = 0.62 [28]. Srisongkram and Weerapreeyakul [29] proposed both classification and regression QSAR models targeting drug repurposing of FDA-approved compounds as KRAS G12C inhibitors, using a dataset of 1255 molecules with reported metrics of = 0.60, = 0.62, Accuracy cv = 0.84, Accuracy ext = 0.85. Both studies [28,29] employed compounds with experimentally measured IC_50_ values against KRAS G12C and applied the XGBoost gradient boosting algorithm together with molecular descriptors, including PubChem and substructural fingerprints. The authors utilized SHAP-based interpretation to identify key fragment contributions. The QSAR results from both studies received further validation through molecular docking.
Studies [30,31] proposed classification-based QSAR models for KRAS G12D inhibitors. Despite acceptable statistical performance, these models present a major limitation: they lack an explicitly defined applicability domain, violating a key QSAR modeling principle established by the OECD expert group [32]. Without this domain, QSAR models cannot meet regulatory suitability criteria. Additionally, these studies [30,31] do not clearly detail data curation and preprocessing procedures, a critical step given the risk of errors and inconsistencies in publicly available chemical databases and other sources [33]. To identify the most active compounds during virtual screening, it is advisable to employ regression models that predict the inhibitory activity levels in greater detail than classification models.
Importantly, the effective application of QSAR models for KRAS inhibitors in medicinal chemistry requires appropriate software tools, either desktop or web-based. Unfortunately, the aforementioned studies [28,29,30,31] did not provide such tools.
In this study, we aimed to construct regression QSAR models for KRAS G12D inhibitors using various molecular descriptors and machine learning methods and to integrate the developed QSAR model into our web-based Python v3.11 framework, KRASAVA (KRAS Automated Virtual Assistant), enabling virtual screening of KRAS G12D inhibitors with simultaneous bioavailability assessment based on Muegge’s rules.
The main stages of this study are presented in Figure 1: (1) collection of experimental data of KRAS G12D inhibitors; (2) data validation taking into account generally accepted recommendations [33]; (3) division of the total dataset into training and testing sets; (4) exploratory data analysis; (5) calculation of molecular descriptors; (6) development, validation and structural interpretation of QSAR regression models; (7) rational molecular design; (8) development of the KRASAVA framework as a reproducible Jupyter Notebook v7.5., executable via Google Colab with subsequent implementation of the best QSAR models; (9) molecular docking of the promising compounds under study. A detailed description of each stage is provided in the Methods and Materials section.
2. Results and Discussion
We developed several QSAR models for KRAS G12D inhibitors as described in Materials and Methods. Table 1 contains the statistical parameters of the QSAR models. All developed models of KRAS G12D inhibitors exhibit satisfactory statistical characteristics and demonstrate comparable predictive performance. The models developed using ECFP4, Topological Path-Based fingerprints, and 2D RDKit descriptors and the CatBoost algorithm demonstrated the best statistical performances (Table 1, the QSAR models highlighted in bold). The p-values for these models under y-randomization are below 0.02, confirming the absence of random correlations in the proposed QSAR models. The consensus model was developed by integrating the above three best models. The applicability domain in consensus forecasting is calculated using ECFP4 fingerprints, since they have the smallest data coverage compared to other types of descriptors used in the consensus model.
The predictive ability of the consensus QSAR model was further validated using a second independent external test set. Due to the relatively small structural space described by the proposed QSAR model, only 21 compounds were included in the applicability domain, with a Q^2^ts of 0.73, demonstrating a sufficiently high level of predictive ability for the proposed consensus QSAR model.
Additionally, we developed 63 QSAR models using the OCHEM platform; see their statistical parameters in Appendix A, Table A1. We obtained the best results by the Random Forest method using, again, ECFP4 descriptors: = 0.66; = 0.69. The model is publicly available at https://ochem.eu/model/20748154 (accessed on 10 December 2025).
To investigate the influence of structural features on the inhibitory activity of KRAS inhibitors, we performed a structural interpretation of the QSAR models developed using Klekota-Roth and PubChem descriptors (Figure 2 and Table A2 and Table A3). In addition, we carried out a matched molecular pair analysis (MMPA) for the best QSAR model developed using the OCHEM platform (Table 2).
Summarizing the obtained results, we can identify the following main structural modifications that increase the inhibitory activity of KRAS G12D inhibitors:
- 1.Substitution of a nitrile group with a hydroxy or alkyne group (molecular transformations 1 and 5 in Table 2), which is consistent with the high contribution to activity of descriptors No. 4 and 9 shown in Figure 2A and Table A2, as well as descriptors No. 4, 6, and 9 shown in Figure 2B and Table A3;
- 2.Replacement of a methoxy group with an alkyne group (molecular transformation 2 in Table 2);
- 3.Replacement of a pyridine fragment with a 1-methylpyrrolidine fragment (molecular transformation 3 in Table 2);
- 4.Substitution of a methoxy group with a hydroxy group (molecular transformation 4 in Table 2);
- 5.Replacement of an ethylene fragment with a pyrrolizidine fragment (molecular transformation 6 in Table 2);
- 6.Introduction of a hydroxyl group in the para-position (molecular transformation 7 in Table 2), which correlates with the high contribution to activity of descriptors No. 4 and 9 (Figure 2A, Table A2) and descriptors No. 4, 6, and 9 (Figure 2B, Table A3);
- 7.Elongation of the linker connected to the imidazole ring (molecular transformation 8 in Table 2);
- 8.Transformation of a phenyl group into a naphthyl group, consistent with the high contribution to activity of descriptors No. 2, 3, and 5 (Figure 2A, Table A2) (molecular transformation 9 in Table 2);
- 9.Replacement of a pyridine fragment with an imidazole ring (molecular transformation 10 in Table 2).
We integrated the consensus QSAR model into the KRASAVA framework, freely available at https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2/blob/main/KRASAVA%20v2.ipynb (accessed on 10 December 2025), a reproducible Jupyter Notebook, executable via Google Colab. In KRASAVA, one can enter the information on the chemical structures of compounds under investigation via the SMILES linear notations [34], or files in CSV or SDF format. One of the processing steps in KRASAVA is the automatic validation and standardization of the input chemical structures. If a user enters an invalid structure, the application returns the index number of the compound in the uploaded CSV or SDF file, along with the SMILES notation of the corresponding structure.
For chemical structures that successfully pass validation and standardization, the application checks for the availability of experimental IC_50_ values for KRAS G12D in the ChEMBL database. If experimental data are available, the framework displays the mean value, standard deviation of the experimental IC_50_ values, and the corresponding ChEMBL compound ID [35]. In this case, activity prediction is not performed.
When analyzing individual compounds, the KRASAVA framework implements the assessment of compliance with Muegge’s rules. Previously, we proposed a set of structural fragments—toxicophores—that consistently increase the level of acute oral toxicity in rats [36]. We integrated the identification of these fragments, along with Brenk filters and PAINS, into the KRASAVA framework (Figure 3), allowing preliminary evaluation of oral bioavailability and potential toxicity.
In addition, the framework includes identification of molecular fragments associated with the most significant molecular descriptors (Figure 2) that were determined through structural interpretation for the purposes of molecular design.
Based on the identified structure-activity relationships and using the capabilities of the KRASAVA framework, we performed a rational molecular design using the compound 4-(3,8-diazabicyclo[3.2.1]octan-3-yl)-8-fluoro-2-[[(2R,8S)-2-fluoro-1,2,3,5,6,7-hexahydropyrrolizin-8-yl]methoxy]-7-[5-methoxy-2-(trifluoromethoxy)phenyl]pyrido [4,3-d]pyrimidine (PubChem CID 156124915, SCHEMBL23053462, BDBM573509, Example 361), having the pIC_50_ value of 5.57, according to the patent US-11453683-B1 [37].
If the structure of compound BDBM573509 is modified by replacing the methoxy group with a hydroxy group (pattern No. 4) and the trifluoromethoxy group (a methoxy derivative) with an alkyne group (pattern No. 2), the inhibitory activity increases—the calculated pIC_50_ value for the new compound (compound 1) is 7.98 (Figure 4). In the modified compound 1, unlike the original molecule BDBM573509, pattern No. 6 is also satisfied, according to which a hydroxy group in the para-position relative to other substituents enhances the inhibitory activity against KRAS G12D.
Further modification, specifically the elimination of the fluorine atom initially located as a substituent in the pyrrolizidine ring, also increases the inhibitory activity—for compound 2, the predicted pIC_50_ value is 8.05 (Figure 4). According to the ChEMBL database, the experimental pIC_50_ value for the known KRAS G12D inhibitor MRTX1133 (PubChem CID 162369732, CHEMBL5081048, PDB ID 6IC) is 8.25 ± 0.47.
In addition, to optimize bioavailability, we performed modifications of the training set compounds and obtained the proposed compound 3 (Figure 4) with a predicted pIC_50_ value of 7.49. Using the KRASAVA framework, we predicted the inhibitory activities of compounds 1–3 by considering their inclusion within the applicability domain of the consensus QSAR model.
For comparative analysis of predictive performance, we also calculated the inhibitory activities of compounds 1–3 using the best QSAR model developed on the OCHEM platform (https://ochem.eu/model/20748154) (accessed on 10 December 2025). For compound 1 and compound 2, this model predicts a pIC_50_ value of 8.80, while for compound 3, the predicted value is 8.1.
Compared to MRTX1133, the investigated compounds 2 and 3 (Figure 5 and Table 3) exhibit a better combination of physicochemical properties, suggesting a potentially acceptable level of bioavailability.
It should be noted that we did not find the structures and experimental IC_50_ values for the investigated compounds 1–3 in the SureChEMBL (https://www.surechembl.org/) (accessed on 10 December 2025), PubChem (https://pubchem.ncbi.nlm.nih.gov/) (accessed on 10 December 2025), or BindingDB (https://www.bindingdb.org/rwd/bind/index.jsp) (accessed on 10 December 2025) databases. However, further studies are required to assess the patent landscape for compounds 1–3 through a detailed analysis of Markush structures in existing patents. Moreover, compounds 1 and 3 are considered solely as examples of rational molecular design based on the structural interpretation of QSAR models, aimed at the balanced search for active KRAS G12D inhibitors with acceptable bioavailability.
For comparative analysis of the QSAR modeling results, we additionally performed molecular docking of compound BDBM573509, compounds 1–3, using MRTX1133 as a reference compound. It binds to a specific pocket (S-IIP) on the surface of mutant KRAS G12D, which forms only in the presence of the mutation and in the inactive (GDP-bound) state of the protein. According to [38], MRTX1133 forms key interactions with Asp69, Asp12, Gly60, Glu62, His95, Arg68, as well as weaker interactions with Gln99, Glu63, and Ser65 (Figure 6).
Table 4 presents the results of molecular docking for compound BDBM573509, compounds 1–3, and MRTX1133. According to the data in Table 4, there is a weak correlation between the pIC_50_ values and the docking scores. Previously, it was shown that molecular docking was explored in parallel with QSAR modeling, but molecular docking failed to correctly discriminate between experimentally active and inactive compounds [39]. In this regard, the priority task in this study was not to calculate the values of docking scores, but primarily to analyze the interactions of the studied compounds with key amino acid residues in the active center of the KRAS G12D enzyme.
Figure 6, Figure 7, Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5 show the interactions of compound BDBM573509, compounds 1–3, and MRTX1133 with key amino acid residues in the KRAS G12D active site. One can see the following key interactions from Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5:
- Ionic interactions with Asp12—compound MRTX1133;
- Gly60 hydrogen bond—compound MRTX1133, compound 1 and compound 2;
- His95 hydrogen bond—compounds MRTX1133, BDBM573509, compound 1 and compound 2;
- Arg68 hydrogen bond—compounds MRTX1133, BDBM573509, compound 1 and compound 3;
- Asp69 hydrogen bond—compound MRTX1133 and compound 3.
The 2D visualization of the interactions of compound BDBM573509, compounds 1–3, and MRTX1133 with key amino acid residues is provided in Appendix A (Figure A1, Figure A2, Figure A3, Figure A4 and Figure A5). The identified interactions are supported by previous studies on similar compounds [38].
Analyzing the data in Table 2 and Figure 2 and Figure 6, we noted a consistent correlation between the structural interpretation of the QSAR models and molecular docking results. According to the QSAR model interpretation, descriptors describing the hydroxy substituent (descriptors No. 2 and 4 in Figure 2A, as well as descriptor No. 5 in Figure 2B) contribute significantly to activity, which is confirmed by the formation of hydrogen bonds with Asp69 in MRTX1133.
The present study of inhibitors has several distinctive features compared to existing studies also devoted to QSAR analysis of KRAS inhibitors. The main advantage of the present study compared with previous works by Srisongkram et al. [28,29] is twofold. First, we have developed and validated QSAR models for KRAS G12D inhibitors, which are of greater medicinal chemistry interest than the KRAS G12C inhibitors investigated by Srisongkram et al. The development of KRAS G12D inhibitors is particularly promising for the treatment of pancreatic and colorectal cancers, where the clinical need is high. Second, compared with the work of Srisongkram et al. [28,29], we propose a freely available framework that significantly automates the virtual screening of KRAS G12D inhibitors. A comparative analysis of this study with existing ones is presented in Table 5, which demonstrates that our research has several advantages. For example, in this study, when developing QSAR regression models, we defined the applicability range, performed a structural interpretation, and, most importantly, integrated the proposed consensus models into the KRASAVA framework, which enables intensified virtual screening of KRAS G12D inhibitors, taking into account preliminary assessments of oral bioavailability and toxicity. The developed QSAR models (Table 1), as well as the algorithms for their construction, statistical characteristics in the form of Jupyter Notebook program files, and the results of molecular docking, are freely available at the GitHub repository https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2 (accessed on 10 December 2025).
3. Methods and Materials
3.1. Dataset of KRAS G12D Inhibitors
For QSAR modeling, we extracted a dataset of compounds with experimentally reported IC_50_ values against KRAS G12D from publication [40]. In accordance with established recommendations for data curation in cheminformatics [33], we verified and preprocessed the initial dataset, consisting of 645 compounds and publicly available at https://zenodo.org/records/11137638 (accessed on 10 December 2025). Mixtures, compounds with incorrect or inconsistent chemical structures, and salts were excluded from further consideration. During the curation process, nine structurally invalid entries and four salt forms were identified and removed. The original Python code used for dataset validation is publicly available at https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2/blob/main/DATA%20curation%20and%20cleaning.ipynb (accessed on 10 December 2025).
Since the objective of this study was to construct regression QSAR models capable of predicting IC_50_ values toward KRAS G12D, only entries containing explicit numerical IC_50_ values were retained; records with non-numeric values containing “>” or “<“ qualifiers were excluded. Experimental bioactivity values expressed in molar IC_50_ were converted into their negative decimal logarithm (pIC_50_), which is conventionally used in QSAR studies due to its improved linearity with respect to biological response.
For compounds having two or more reported experimental pIC_50_ measurements, we calculated the mean and standard deviation. We retained only entries with a standard deviation not exceeding 0.5 log units, in accordance with a previously described procedure for handling duplicate measurements [41].
To evaluate the predictive performance of the QSAR models, we divided the initial dataset comprising 566 compounds into training (ws) and test (ts) sets. The final sorted dataset was ordered by increasing pIC_50_, and every fifth compound was assigned to the test set, while the remaining entries formed the training set. As a result, the training set and the test set comprised 452 and 114 compounds, respectively. We exported both subsets into SDF format using RDKit v 2025.03.6 in Python v3.11; they are publicly available at https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2/tree/main/datasets (accessed on 10 December 2025).
The distribution of activity values in the training and test sets is shown in Figure 8A. As can be seen, the two sets exhibit similar ranges and frequencies of experimental pIC_50_ values. The investigated compounds cover a sufficiently broad pIC_50_ range of more than five logarithmic units, which has a positive effect on the descriptive and predictive performance of the developed QSAR models. As previously demonstrated in [42], an adequate QSAR model requires an activity range of at least one logarithmic unit.
Visualization of the chemical space of the training and test sets in the molecular weight (MW)—lipophilicity (LogP) coordinate system is presented in Figure 8B. Analysis of this plot indicates a sufficiently high level of chemical diversity in both subsets, as evidenced by the wide ranges of molecular weight and lipophilicity of the investigated compounds.
In order to further assess the predictive power of the developed models, a second external test set was formed by aggregating compounds with experimental pIC_50_ values for KRAS G12D from the ChEMBL [35], BindingDB [43] databases, and a study [44] devoted to the development of KRAS G12D inhibitors. When collecting data, the data verification and preprocessing methodology described above was applied. The total volume of the second independent test sample was 1266 compounds.
3.2. Development of QSAR Models
Molecular structures were described using ECFP4 (radius = 2, nBits = 1024, useFeatures = False, useChirality = False), MACCS (166-bit structural keys), Klekota-Roth, PubChem descriptors, Topological Torsion, Atom Pairs, and Topological Path-Based fingerprints. Descriptor calculation was performed using the RDKit v2025.03.6 [45] and PaDEL-Descriptor v0.1.11 (PadelPy) [46,47] libraries in Python v3.11. We built QSAR models using the scikit-learn [48] and CatBoost v1.2.6 [49] libraries, applying gradient boosting (CatBoost), support vector machines (SVM), and fully connected neural networks with a multilayer perceptron (MLP) architecture. Hyperparameters of the models were automatically optimized using GridSearchCV implemented in scikit-learn. Both the model hyperparameters and the implementation scripts in Jupyter Notebook format are publicly available at https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2 and can be utilized for QSAR modeling of other types of biological activity.
To assess the robustness of the models, five-fold internal cross-validation (5-fold CV) was performed [50]. The inclusion of test set compounds within the applicability domain (AD) was evaluated using the similarity distance approach [51]. A test compound is considered to belong to the QSAR model’s applicability domain if its similarity distance does not exceed the threshold value D_c_, calculated using Equation (1):
where and σ are the mean and standard deviation of the Euclidean distances in the descriptor chemical space between all compounds in the training set and their nearest neighbors; is a constant, typically set to 0.5.
Data coverage (Cov) within the applicability domain was calculated as the ratio of the number of test set compounds falling within the AD to the total number of compounds in the test set. The predictive performance of the QSAR models was evaluated using the coefficient of determination (Q^2^):
And the root mean square error (RMSE):
where is the observed activity of the i-th compound, is the predicted activity of the i-th compound, is the mean observed activity, and m is the number of compounds in the dataset. We calculated the Q^2^ and RMSE values using the scikit-learn library.
To improve predictive power, we developed a consensus model. The predicted consensus activity value was calculated as the average of the predictions from the three best QSAR models.
For validation of adequate QSAR models, the y-randomization method with 50 iterations was applied using the permutation_test_score module in the scikit-learn library. The absence of random correlation in the QSAR models was estimated using the Q2_rand determination coefficient and p-value. If a p-value is sufficiently small, usually below a certain threshold (e.g., 0.05), there is no random correlation in the relationship described by a model [52].
During structural interpretation, we evaluated the contribution of descriptors to the QSAR models using the SHAP library in Python v3.11 [53]. We assessed feature importance in SHAP v0.44.0 based on Shapley values, which quantify the contribution of each descriptor to the model predictions.
For a comparative analysis of the predictive performance of the developed QSAR models of KRAS inhibitors, we additionally employed the OCHEM web platform (https://ochem.eu) (accessed on 10 December 2025), which provides various sets of molecular descriptors and machine learning algorithms. The OCHEM platform integrates several software packages for calculating a wide range of descriptor sets. In this study, we used eleven descriptor sets for QSAR model development, including OEState [54] combined with AlogPS [55], CDK [56], Dragon v7 [57], QNPR descriptors [58], Extended Connectivity Fingerprint 4 (ECFP4) [59], alvaDesc [60], Fragmentor (length: 2–4) [61], MOLD2 [62], and MORDRED [63]. QSAR models were developed using various machine learning methods, including Associative Neural Networks (ASNN) [55], Random Forest, Gradient Boosting implemented in the XGBoost library [64], Deep Neural Networks (DNN) [65], k-Nearest Neighbors (KNN) [66], and Multiple Linear Regression Analysis (MLRA) [67]. In addition, several deep learning approaches were applied: Transformer Convolutional Neural Network (TRANSNNI) [68], Attentive FP [69], and Chemprop [70]. The models were built using optimized parameter settings for each machine learning method provided by the OCHEM platform. The applicability domain was assessed using the “distance-to-model” concept, specifically the “BAGGING-STD” approach described in the OCHEM user manual [71].
3.3. Framework KRASAVA
We implemented the consensus QSAR model in KRASAVA (https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2/blob/main/KRASAVA%20v2.ipynb) (accessed on 10 December 2025), a Python-based framework realized as a reproducible Jupyter Notebook, executable via Google Colab. The KRASAVA framework was created in Python v3.11 and utilizes the RDKit v2025.03.6 [45], scikit-learn v1.3.1 [48], NumPy v1.22.1 [72], and Pandas v1.3.5 [73] libraries.
3.4. Molecular Docking
We used GNINA v1.3 [74] because its machine-learning-based scoring functions provide superior performance compared to Vina. Although Gnina follows the same docking workflow as Vina, it achieves higher accuracy by rescoring ligand poses with convolutional neural network scoring functions after the initial Vina scoring. This additional ML-based rescoring step improves the reliability of binding pose predictions. Gnina demonstrated strong performance in practical applications [75] and independent benchmark studies [76], further supporting its reliability.
To validate the selected docking protocol, a re-docking procedure was conducted, in which the best docking pose of the ligand was superimposed onto the co-crystallized ligand conformer, and the root-mean-square deviation (RMSD) was measured. According to [77], the generally accepted RMSD threshold for re-docking, given the chosen preprocessing and docking protocol, should be ≤2 Å.
For molecular docking of potential KRAS G12D inhibitors, the 7T47 protein model (resolution: 1.27 Å, single chain A) was selected from the RCSB Protein Data Bank (https://www.rcsb.org/) (accessed on 10 December 2025), as it contains a co-crystallized KRAS G12D inhibitor MRTX1133 (PubChem CID: 162369732, CHEMBL5081048, PDB Chemical Component ID: 6IC), which served as a reference compound [11].
Ligand and protein preprocessing were performed in Chimera [78] using the Dock Prep module. Ligand preprocessing included protonation and geometry optimization using the GAFF2/AM1-BCC force field. During the preprocessing of the molecular target, the native 7T47 protein model was cleaned of co-crystallized ligands (GCP, GDP, glycerol, and acetate ion), water molecules, and the magnesium ion. Chain A was protonated at pH 7.4, partial charges were added, and missing amino acid residues were restored.
In GNINA, the binding site was defined based on the position of MRTX1133 (−23.00 Å; 5.14 Å; 23.02 Å) using the autobox_ligand option with default parameters (exhaustiveness = 16, num_modes = 9) and a 4 Å margin in all directions.
During the re-docking of MRTX1133, the RMSD value was 0.76 Å (Figure 9), indicating that the applied docking protocol successfully reproduces ligand conformations consistent with crystallographic data. The RMSD value was calculated using the mcs_rmsd function implemented in the useful_rdkit_utils library v 0.93 [79].
Molecular docking results were visualized using the ProteinsPlus web application [80]. All docking data are publicly available at https://github.com/ovttiras/QSAR_KRAS_inhibitors_v2/tree/main/Docking (accessed on 10 December 2025).
4. Conclusions
The conducted study enables the following outcomes:
- Development of a series of regression QSAR models for KRAS G12D inhibitors using ECFP4, Klekota-Roth, PubChem, MACCS, Topological Torsion, Atom Pairs, Topological Path-Based fingerprints, and RDKit descriptors, along with CatBoost, SVM, and MLP algorithms, as well as models developed via the OCHEM platform;
- Structural interpretation of QSAR models for KRAS G12D inhibitors, identifying the most significant fragments and molecular transformations;
- Integration of the consensus QSAR model into the KRASAVA framework, enabling retrieval of experimental data for investigated compounds, as well as virtual screening of potential KRAS G12D inhibitors with preliminary assessment of bioavailability through Muegge’s rules compliance, and evaluation of acute toxicity via identification of key toxicophores and Brenk filters;
- Rational molecular design of compounds based on the structural interpretation results and capabilities of the KRASAVA framework, leading to the proposal of two most promising KRAS G12D inhibitors;
- Comparative analysis of the proposed compounds through molecular docking, examining the nature of their interactions with the KRAS G12D binding site, and validating the results obtained from QSAR structural interpretation.
The results of this study are expected to reduce financial, temporal, and labor costs associated with the synthesis and testing of new KRAS G12D inhibitor drugs. We consider experimental validation an important direction for future studies.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bannoura S.F. Khan H.Y. Azmi A.S. KRAS G 12D Targeted Therapies for Pancreatic Cancer: Has the Fortress Been Conquered?Front. Oncol.202212101390210.3389/fonc.2022.101390236531078 PMC 9749787 · doi ↗ · pubmed ↗
- 2Zhu G. Pei L. Xia H. Tang Q. Bi F. Role of Oncogenic KRAS in the Prognosis, Diagnosis and Treatment of Colorectal Cancer Mol. Cancer 20212014310.1186/s 12943-021-01441-434742312 PMC 8571891 · doi ↗ · pubmed ↗
- 3Zeissig M.N. Ashwood L.M. Kondrashova O. Sutherland K.D. Next Batter up! Targeting Cancers with KRAS-G 12D Mutations Trends Cancer 2023995596710.1016/j.trecan.2023.07.01037591766 · doi ↗ · pubmed ↗
- 4Cox A.D. Der C.J. Ras History Small GT Pases 2010122710.4161/sgtp.1.1.1217821686117 PMC 3109476 · doi ↗ · pubmed ↗
- 5Prior I.A. Lewis P.D. Mattos C. A Comprehensive Survey of Ras Mutations in Cancer Cancer Res.2012722457246710.1158/0008-5472.CAN-11-261222589270 PMC 3354961 · doi ↗ · pubmed ↗
- 6Ryan M.B. Corcoran R.B. Therapeutic Strategies to Target RAS-Mutant Cancers Nat. Rev. Clin. Oncol.20181570972010.1038/s 41571-018-0105-030275515 · doi ↗ · pubmed ↗
- 7Li Y. Yang L. Li X. Zhang X. Inhibition of GT Pase KRASG 12D: A Review of Patent Literature Expert Opin. Ther. Pat.20243470172110.1080/13543776.2024.236963038884569 · doi ↗ · pubmed ↗
- 8Muñoz-Maldonado C. Zimmer Y. MedováM. A Comparative Analysis of Individual RAS Mutations in Cancer Biology Front. Oncol.20199108810.3389/fonc.2019.0108831681616 PMC 6813200 · doi ↗ · pubmed ↗
