A Data-Driven Approach for Interpretable and Efficient Predictive Modeling: A Case Study in SARS-CoV-2 Protease Inhibitor Discovery Through Feature Selection
Branislav Stanković, Sang-Yong Oh, Dušan Ramljak

TL;DR
This paper introduces a reliable and efficient method for drug discovery, specifically for finding inhibitors of the SARS-CoV-2 protease, using interpretable predictive models.
Contribution
The novel contribution is a validated framework combining FeatureWiz and stepwise selection for interpretable and efficient predictive modeling in drug discovery.
Findings
Combining FeatureWiz with stepwise selection satisfies all evaluation criteria for chemoinformatic models.
Two-dimensional descriptors with OLS regression achieved the best predictive performance.
The framework provides transparent and computationally efficient models for biological activity prediction.
Abstract
Background/Objectives: Feature selection approaches should satisfy all evaluation criteria required by state-of-the-art chemoinformatic models. Our aim is to develop a methodology that is robust, interpretable and computationally efficient. Methods: This study presents a robust methodology for developing highly interpretable and computationally efficient predictive models, with a specific application in the discovery of SARS-CoV-2 main protease inhibitors. We evaluated various descriptor selection procedures to identify a transparent and reproducible approach that provides actionable insights for data-driven decisions. The models were trained and tested using molecules from the CHEMBL database and further validated on an external set of compounds. Results: Our findings demonstrate that a recently proposed procedure, combining the FeatureWiz algorithm with stepwise feature selection, is…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12- —Penn State Great Valley
- —Ministry of Science, Technological Development and Innovation of the Republic of Serbia
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Bioinformatics · vaccines and immunoinformatics approaches
1. Introduction
The SARS-CoV-2 pandemic has had a profound global impact, affecting almost 800 million people and causing more than 7 million deaths by November 2025 [1]. Even with the widespread use of vaccines, the ongoing emergence of new cases and deaths highlights the ongoing need for effective therapeutic drugs. Various pharmaceutical classes have been tested as drugs for COVID-19, including antibacterials, antimalarials, antivirals, immunomodulators, and others [2,3]. Concerns persist regarding vaccine safety for individuals with allergies, immune disorders, or those who are pregnant, and their efficacy against evolving coronavirus variants remains uncertain.
A primary target for pharmaceutical interventions is the coronavirus 3-chymotrypsin-like protease ( or ). This key protease, composed of approximately 306 amino acids, is essential for viral replication and transcription. Its substrate-recognition pocket is highly conserved across all coronaviruses and could help to find a drug that would work in the event of a new pandemic. It features key residues, including the nucleophilic sulfur of Cys 145 and the imidazole ring of His 41, which are crucial for its enzymatic activity. Therefore, disrupting the function of these residues is an effective strategy for inhibiting [4,5].
Given the challenges and high costs of traditional drug discovery, computational approaches have become indispensable. Quantitative Structure–Activity Relationship (QSAR) modeling allows researchers to predict biological activity directly from molecular structure, enabling faster and more efficient identification of promising compounds. In drug repurposing, QSAR is particularly valuable as it can reveal new therapeutic potentials for existing molecules, thereby reducing the development time and the associated cost.
Data-driven methods leveraging machine learning (ML) have greatly enhanced the potential of QSAR modeling by enabling the efficient handling of large chemical datasets and the extraction of complex, non-linear patterns between structure and activity. This integration makes QSAR a powerful tool for predicting biological activity, guiding drug discovery, and supporting repurposing efforts. Using ML, QSAR can provide faster, more accurate, and generalizable insights compared to traditional approaches.
This study introduces a robust, highly interpretable, and computationally efficient methodology for developing predictive models in drug discovery, specifically demonstrated through the identification of potent SARS-CoV-2 main protease inhibitors.
Our primary contribution is the development and validation of a transparent and reproducible approach. We rigorously evaluated various descriptor selection techniques and demonstrated that a combined approach using the FeatureWiz algorithm followed by stepwise selection yields superior performance compared to the benchmarks from the literature. In addition to providing actionable insights for predictive models, it is the only approach that satisfies all evaluation criteria required by state-of-the-art chemoinformatic models.
We successfully trained and tested a model using molecules from the CHEMBL database, and externally validated the model using an external set of compounds. External set is added to ensure generalizability. Our key finding is that models built on two-dimensional molecular descriptors and the Ordinary Least Squares (OLS) regression method achieved the best overall results. These specific methodological choices constitute a significant contribution by providing inherent interpretability and high computational efficiency, which are critical factors for rapid, data-driven decisions in a drug discovery setting.
Ultimately, the models derived from this validated framework serve as effective, transparent tools for the rapid and reliable prediction of biological activity, establishing a crucial foundation for data-driven decisions in the fight against COVID-19 and accelerating other therapeutic programs. A key practical outcome of this work is the proposal to integrate these models into high-throughput screening (HTS) processes.
The remainder of the paper is organized as follows. In Section Interpretable and Efficient Predictive Modeling Background we explore the relevant prior research on interpretable and efficient predictive modeling. The results and their implications are discussed in Section 2. The dataset, design, and implementation of our approach are presented in Section 3. Finally, Section 4 provides concluding remarks and outlines directions for future research.
Interpretable and Efficient Predictive Modeling Background
Quantitative structure–activity relationship modeling has long been recognized as a cornerstone of computational drug discovery, operating on the principle that the biological activity of a compound is fundamentally influenced by its chemical structure. This structural–activity paradigm has been widely applied to , a highly conserved viral enzyme that is essential for the SARS-CoV-2 replication, and therefore a critical target for therapeutic intervention [6,7,8,9,10,11,12,13,14,15,16]. Several QSAR studies on inhibitors reveal that branching within molecular scaffolds generally reduces inhibitory activity [6,7,8]. Authors in [14] demonstrated that electron-donating substituents capable of hydrogen bonding enhance activity, while authors in [15] showed that the spatial distribution of electronegativity strengthens intermolecular polar interactions and minimizes intramolecular effects. Together, these findings underscore the importance of electronic and structural properties in shaping the inhibitory potential.
Heteroatoms, particularly nitrogen and sulfur, have been highlighted as key determinants of inhibitory potency in multiple studies. Authors in [6] reported that the pyridine and thiophene rings enhance activity, whereas imide and thioimide groups reduce it. Authors in [14] further identified the pyrrolidine and imidazole moieties as contributing to strong inhibition. Authors in [10] found that single nitrogen-containing rings, such as pyridine and piperidine, are favored over multi-heteroatom scaffolds such as pyridazine, pyrimidine, thiazole, and pyrazole. In contrast, secondary amides have been repeatedly associated with a decrease in inhibitory potential [7,8,10]. Beyond these structure–activity trends, authors in [16] incorporated quantum chemical descriptors such as the lowest unoccupied molecular orbital ( ) and polarizability ( ), which are well-established in enzyme–substrate interaction studies [17,18,19,20]. These results highlight the diversity of molecular features that can inform predictive modeling but also reveal the challenges of consistently selecting descriptors that balance predictive accuracy and mechanistic interpretability.
The selection of descriptors remains a pivotal step in QSAR modeling because descriptors serve as the quantitative link between chemical structure and biological activity. Traditional strategies—such as genetic algorithms, stepwise regression, and backward elimination—have been extensively used to identify informative descriptors [6,7,8,9,10,11,12,13,14,15,16]. However, these approaches can be sensitive to multicollinearity and data dimensionality, limiting their reproducibility. Recent work has proposed transparent, reproducible selection methods, including the integration of novel algorithms such as FeatureWiz with conventional stepwise procedures [21]. These techniques not only address limitations that previous models exhibited, but also emphasize interpretability, which is a priority in modern data-driven drug discovery where decision-making requires both reliability and clarity.
The modeling methodology is equally decisive in shaping the utility of QSAR frameworks. Although 3D descriptors and nonlinear learning algorithms can capture subtle structural effects, they often produce models that are computationally expensive and difficult to interpret for practitioners. By contrast, 2D descriptors provide a tractable balance between simplicity and informativeness, allowing for rapid, interpretable predictions that are more easily adopted in pharmaceutical pipelines [22]. Similarly, regression methods such as OLS remain highly valued for their transparency. Unlike complex black-box approaches, OLS coefficients directly map molecular features to activity, providing decision-makers with actionable insights into structural modifications. The computational efficiency of OLS further enhances its suitability for large-scale screening and real-time prioritization in drug discovery pipelines [23,24,25].
More broadly, the value of simplicity aligns with the principle of Occam’s Razor widely used in ML, advocating that models should be as simple as possible while still capturing essential patterns. Transparent and efficient models foster trust, reproducibility, and regulatory acceptance, which are essential for translational impact in fields like antiviral drug discovery. Indeed, several studies demonstrate that parsimonious models can achieve predictive performance comparable to or even exceeding that of more complex architectures [26,27,28]. Together, this literature underscores the urgent need for QSAR frameworks that integrate robust descriptor selection with transparent modeling, allowing actionable insights that are both statistically reliable and practically relevant to accelerate COVID-19 therapeutic development.
2. Results and Discussion
2.1. Evaluation of Models
Assessing the predictive capacity of the developed models is based on understanding the distribution of molecules in the chemical space. The SlogP vs. MW space is a traditional representation widely used in cheminformatics to visualize the distribution of compounds in a simplified physicochemical space. It does not reflect the full structural space defined by the molecular descriptors used for model development. Thus, this representation is included only to illustrate how molecules are positioned in a commonly used chemical space. As shown in Figure 1, the molecules in the training, test, and external sets exhibit a diverse chemical profile. The training and test sets occupy similar regions, which is essential for consistent model applicability. The external dataset covers a slightly broader chemical space, confirming its suitability for validating the model’s ability to predict the activity of novel compounds.
2.2. Model Selection and Validation
In the context of data-driven decision-making, the choice of molecular descriptors is a critical first step. Although 3D descriptors can offer detailed information, their calculation is time-consuming and can be prone to inaccuracies, making them less suitable for rapid large-scale analysis. The majority of literature on QSAR models for inhibitors focuses on 2D descriptors, which are simpler and more directly related to chemical constitution. The computational efficiency and greater interpretability of 2D descriptors are a significant advantage for decision-makers who need to quickly understand the structural features driving a prediction. For this reason, our study utilized 1613 2D molecular descriptors from the Mordred package version 1.2.0.
Quantum-chemical descriptors are inherently interpretable and can readily guide (pharmaceutical) chemists in rational compound design. Furthermore, although they are traditionally considered computationally expensive, recent advances in machine learning allow their accurate estimation at significantly reduced computational cost. Therefore, two quantum-chemical descriptors from a previous study of [16] were added to the 1613 2D molecular descriptors from the Mordred package version 1.2.0. From the initial set of descriptors, 969 descriptors remained after excluding those with missing or non-numerical values.
The choice of regression model plays a crucial role in data-driven decision-making. Although complex “black box” models can sometimes offer marginal improvements in predictive accuracy, interpretability often takes precedence when clear insights and transparency are essential. Models that reveal how descriptors influence activity are particularly valuable in guiding practical applications.
The case when OLS performs well suggests that a strong, simple linear relationship underlies the data, making the more interpretable model the preferred choice for this application. OLS coefficients directly indicate how each descriptor relates to activity, offering explainability by design. For example, a positive coefficient immediately signals that increasing the corresponding descriptor tends to increase predicted activity—providing drug designers with straightforward and actionable insights.
To generate the results, a genetic algorithm was applied to construct 500 models; however, none fulfilled the QUIK rule test or the selection criteria outlined in Section 3.5. Similarly, none of the five models developed using SequentialFeatureSelector, RFE, and SelectKBest algorithms passed the QUIK rule test, suggesting the presence of multicollinearity in all cases. In contrast, models obtained through RFE, SequentialFeatureSelector with forward selection, and SelectKBest with the f_regression function (Equations (A1), (A2) and (A3), respectively) satisfied all metric-based evaluation criteria (Table A1 and Table A2).
Our descriptor selection process demonstrated that the FeatureWiz algorithm followed by stepwise feature selection was the most effective method. This procedure not only streamlines the modeling process but also helps to mitigate multicollinearity, ensuring that the selected descriptors provide unique and meaningful insights. Neither nor was selected, suggesting that these descriptors may only be effective for the specific class of molecules investigated by [16] or that they require a more precise, time-consuming method for calculation when applied to larger datasets.
We present three selected models. Model 1 (Equation (1)) and Model 2 (Equation (2)) were developed using a correlation threshold of , while Model 3 (Equation (3)) was based on a threshold of . In particular, Model 2 incorporates 10 of the 14 descriptors included in Model 1, raising questions regarding the influence of the descriptor count on model performance.
The metrics for these models are summarized in Table 1 and Table 2. The consistently high F-values indicate that the selected descriptors collectively explain the variation in activity beyond chance. All models met the predefined criteria for robustness and predictability.
All models listed in Table 1 obey the QUIK rule, with no descriptors having a VIF higher than 5, indicating minimal multicollinearity. Moreover, the low values of coefficients of determination for the training set, , and leave-one-out cross-validation, , calculated in Y-scrambling suggest the absence of chance correlation, further validating the reliability of the models for predicting inhibitory activity against the SARS-CoV-2 main protease.
The results demonstrate overall comparability across all models. Model 2 shows only slight variation compared to Model 1 (making it a good alternative), while Model 3 exhibits more significant differences, particularly in parameters related to prediction accuracy. These variations could impact model performance during external validation. It is worth noting that the models which we decided not to focus on and thus represented in the Appendix A by Equations (A1)–(A3) also exhibit slightly worse performance compared to Model 1 (Table A1 and Table A2 in the Experimental Section of Appendix A). Good performances in cross-validation experiments, as well as similar performances of the model when trained on the complete dataset, can indicate good stability of a model.
To thoroughly investigate the performances of Models 1–3, scatter plots comparing experimental and predicted activity values for the training and test sets (Figure 2) were analyzed, along with the Williams plots (Figure 3). Williams plots are primarily designed to define and visualize the applicability domain of a QSAR/ML model, rather than to directly assess the quality of a train/test split. In principle, they may indicate a problematic split in cases where the training set compounds are clustered in a narrow region around the centroid of the descriptor space, while test set compounds occupy more extreme regions (i.e., show high leverage values). Since leverage reflects the distance from the centroid of the training set in descriptor space, a predominance of high-leverage test compounds could suggest structural imbalance and potential extrapolation. In our case, both training and test compounds occupy the same descriptor space region and are well within the defined applicability domain. The leverage values do not indicate structural extremity of the test compounds relative to the training set. Furthermore, the Williams plot is inherently constructed based on the selected training set and descriptor combination, which naturally centers the training compounds in the defined space. Taken together, the leverage analysis does not suggest structural imbalance between the training and test sets. This conclusion is further supported by the PCA analysis, which shows comparable distribution patterns for both sets in the descriptor space and will be presented separately. Thus, it is evident that all models demonstrate strong fitness and predictability. Models 1 and 2 show similar results, showing more accurate predictions for molecules with higher values compared to Model 3 (Figure 2). Since one of the goals of the drug discovery process is to develop drugs with high values, this represents a particular advantage.
Additionally, one molecule shows a slightly higher positive standard deviation (Figure 3c), although it does not surpass the critical value of 3 (i.e., no Y-outliers are detected). Moreover, all compounds in both the training and test set fall below the lowest threshold of (for Model 2, which has the lowest number of descriptors), indicating the absence of response outliers and suggesting that predictions of inhibitory activity against the SARS-CoV-2 main protease could be extrapolated by all three models (Figure 3).
To further ensure that the model does not have multicollinearity and overfitting issues and to confirm that all descriptors within the model are relevant, several other linear regression methods were performed. Even though other models are not strictly necessary to present a novel data-driven methodology for feature selection, their comparison with Models 1–3 strengthens its validation, as it demonstrates its advantages relative to alternative model configurations. Thus, we present Table A3 and Table A4 in the Experimental Section of Appendix A. As can be seen, performance decreases in the following order: OLS → Ridge → LassoLars → Bayesian Ridge → ARD → Lasso → Linear SVR. Since the OLS method unequivocally gives the best results and Model 1 and Model 2 show better performance than Model 3, in the remainder of the study, the focus of this research will be on the results obtained by Model 1 and Model 2. The fact that OLS performed optimally suggests that a strong, simple linear relationship underlies the data, making the more interpretable model the preferred choice for this application.
Given the relatively small size of the CHEMBL database dataset, questions may arise regarding the model’s capability to predict values for compounds with more structurally diverse profiles. To address this, values from various QSAR studies in the literature [6,7,8,9,10,11,12,13,14,15,16] were compiled to form an external dataset of molecules not present in the CHEMBL database. From the plot of the second versus the first principal component calculated on Model 1 descriptors, it can be observed that, similar to the previously discussed SlogP versus MW plot, molecules from the external dataset occupy a slightly broader chemical space than molecules from the CHEMBL database (Figure 4a), justifying its use. As a side note, it is convenient to notice that PCA divides the datasets into two distinctive subsets. The smaller subset consists of sulfur compounds and aromatic ketones, while the remaining molecules are in the larger subset.
The ability of Model 1 and Model 2 to predict values of molecules from the external dataset was tested in two different ways: by using original models (results marked as 1e and 2e in Table 2) and using models trained on the full ChEMBL dataset (results marked as 1f and 2f in Table 2). As can be seen, all metrics show comparable values to those obtained on the internal test set. This is an excellent result given that, due to the heterogeneous sources of the external set chemicals, the higher variability in the results should be expected. In Figure 4b, a scatter plot comparing predicted versus experimental values for the training and external sets shows that even the two molecules from the external set with higher values than those in the training set are accurately predicted. Additionally, as observed from the Williams plot, no outliers are detected in the external set (Figure 4c). Therefore, there is a robust foundation to assert that the proposed model is suitable for predicting inhibitory activity within a broad chemical space (see Figure 5) and that consequently descriptor selection methodology can extract useful information from limited data. It is worth mentioning that, although Figure 4b,c present results of original Model 1, plots obtained by Model 2, as well as adequate models trained on the full ChEMBL dataset, do not differ significantly.
Due to the non-uniform distribution of data points, with a dense representation within the 4.5 to 5.5 range and sparse representation outside of it, the models’ ability to predict values in these two sparse regions was tested. The models demonstrated similar performance to those presented in Table 1 and Table 2, confirming that they are not biased towards the majority range.
2.3. Comparison with Models Found in the Literature
To the best of our knowledge, Authors in [9] developed the only model existing in the literature using the dataset from the CHEMBL database. However, as shown in Table 3 detailing the performances of models found in the literature, this model is characterized by unsatisfactory statistics in external validation. Similarly, the models developed by [6,7,8,12,16] exhibit low values of and/or . Furthermore, the model of [13] does not meet accuracy requirements due to a low value of , while the model of [10] has this shortcoming along with a low value of and a high value of . Additionally, the model of [14] fails to meet the criterion .
Therefore, among the models from the literature, only those by [11,15] are more closely aligned with the requirements of state-of-the-art QSAR modelling. However, ref. [15] did not calculate all relevant statistics to ensure that. Additionally, comparing these models directly with this study is challenging for several reasons. While 2D descriptors were employed in this study, the aforementioned research utilized 3D descriptors, which are not only more computationally intensive—requiring precise atom coordinates—but also often less intuitive. As discussed, this may result in less precise insights into the design of drug candidates compared to 2D descriptors. In addition, both models were developed for specific classes of inhibitors—ketone-based covalent inhibitors [11] and unsymmetrical aromatic disulfides [15]. Nevertheless, since the methodology presented in this study demonstrates comparable internal fitting parameters and superior external validation performance compared to that of [15], it can be inferred that it is better suited to predict the inhibitory activity of novel compounds. On the other side, while the model of [11] demonstrates better performance in 6 out of 9 parameters as shown in Table 3, once again, it is important to note that their model is specifically trained and tested on 29 derivatives of a certain compound, making its applicability limited to a narrow chemical space.
2.4. Mechanistic Interpretation
Mechanistic interpretation of QSAR models is an essential part of modelling because it provides insights into the biological or chemical processes underlying compound activity, thus enhancing scientific understanding and model validation. In other words, it ensures the model’s predictions are grounded in known principles, improving predictive power and reliability. In drug design, mechanistic insights guide the creation of effective and safe molecules, while also meeting regulatory requirements [29]. To better understand the mechanisms interpretation of Models 1 and 2, Table 4 provides the physical meaning of the corresponding descriptors. Additionally, Table 5 shows six molecules from the CHEMBL dataset (three with low values and three with high values), along with the values of all descriptors. This comparison highlights the differences in descriptor values between molecules with varying activity levels, further clarifying the relationship between molecular structure and inhibitory activity.
By reflecting its sensitivity to differences in atomic numbers among atoms at a lag of 7, the AATS7Z descriptor shows a positive correlation with activity. It favours molecules containing heavy atoms (such as Br, S, I, and Cl) and aromatic rings. These features can increase inhibitor activity by enhancing hydrophobicity and polarizability, which promotes intermolecular interactions over intramolecular interactions. EState_VSA9 descriptor is part of the EState family of descriptors and serves to quantify a specific aspect of Van der Waals surface area contributions based on electrotopological state considerations. In analysed datasets, molecules with high values of EState_VSA9 typically contain chlorine atoms (Table 5), which supports the frequent use of chlorine and other halogens as substituents in drugs, enhancing ligand–protein interactions via halogen bonds [30].
Molecules with higher values of GATS5c exhibit charge distribution over longer distances, allowing for more efficient interactions with amino residues. In contrast, molecules with lower values of GATS5c have a dense distribution of charged atoms (Table 5), resulting in lower hydrophobicity and reduced activity. Since oxygen and nitrogen atoms tend to have relatively low Gasteiger charges, molecules containing keto and ether groups, as well as nitrogen-containing rings, are favoured. Conversely, the last two molecules listed in Table 5 contain two nitrogen atoms from a pyrimidine ring at a topological distance of 5 from nitrogen in a secondary amide group, resulting in a low value for this descriptor. GATS5c was also identified in a study on 2,5-disubstituted furans as antimalarial drugs [31]. This adds to its importance, given that antimalarials have demonstrated in vitro efficacy against SARS-CoV-2.
The GATS2 descriptor has small values when nitrogen, sulfur, and oxygen atoms are at a topological distance of 2, i.e., for molecules containing (thio)amide groups and multiple heteroatom-containing rings. Similarly, the SssNH descriptor summarizes electrotopological states of nitrogen atoms in the form of –NH–, thus having higher values for secondary amides, which are known to exhibit low inhibitor activity [7,8,10].
Selection of n10FaRing is not surprising, given that molecules with large surface area, hydrophobic terminal groups, and stable aromatic substituents often show higher activities [9,11]. NaaaC is included in the model developed by [13], as well as in research on drugs with inhibitor activity against Alzheimer’s disease [32]. This descriptor has been associated with various interactions such as H-bonding, salt bridges, alkyl groups, and -sigma, -cation, and -alkyl interactions.
naHRing shows a weak positive correlation with values. While this might be surprising for any descriptor with integer values, naHRing has demonstrated particular significance in studies of activin receptor type-5 kinase inhibitors [33] and cannot be omitted from Models 1 and 2 without significantly affecting performance. Additionally, the majority of research highlights the significance of heteroatomic rings [6,10,14].
The descriptor StsC is related to the existence of a carbon atom with one triple and one single bond, i.e., cyano groups. According to findings by [14], inhibitor activity increases with the electron-donating ability of substituents. Therefore, since the cyano group is electron-withdrawing, the negative sign of the coefficient for StsC can be easily understood. Furthermore, NssssC has been selected, representing the number of quaternary carbon atoms. As said, it is known that branching decreases inhibitory activity [6,7,8]. In addition, this descriptor is included in the study of [8].
The PEOE_VSA10 descriptor, part of the partial equalization of orbital electronegativity descriptors, quantifies the Van der Waals surface area contribution from specific molecular segments. In other words, this descriptor reflects Van der Waals interactions associated with the partial charges of certain atoms or functional groups within the molecule. In analysed datasets, molecules featuring pyrimidine, thiazole, and pyrazole rings exhibit high PEOE_VSA10 values (Table 5). As said, the literature suggests that molecules with multiple heteroatom-containing rings tend to show lower inhibitory activity [10]. Notably, PEOE_VSA10 was selected in the study of HIV-1 protease inhibitors [34], some of which, like lopinavir/ritonavir, were proposed for SARS-CoV-2 treatment [2,3]. Similarly, the PEOE_VSA6 descriptor shows high values in molecules containing pyrimidine rings, secondary amides, and aromatic ketones. As mentioned, secondary amides are known for their lower activity, while aromatic ketones’ low inhibitory activity may be due to their keto groups’ poor electron donation capability. Additionally, the EState_VSA4 and SMR_VSA6 descriptors are selected, representing the electrotopological state and molar refractivity contribution of specific atoms or functional groups to the Van der Waals surface area. In analysed datasets, high EState_VSA4 values are observed in molecules containing pyrimidine rings and secondary amides, while SMR_VSA6 values peak when a S atom is bonded to the C2 atom of a pyrimidine ring and in molecules with chlorine. Given the similarity of molecules with high values of these descriptors, a detailed analysis is warranted to understand better their nuances (which is currently out of the scope of this study). However, PEOE_VSA6 and EState_VSA4 are only part of Model 1, which, as discussed, exhibits slightly better performance than Model 2, highlighting that these descriptors might be affecting the performance of molecular activity prediction.
A closer analysis of the dataset and the mechanistic interpretation of descriptors suggest that molecules with high inhibitory activity against SARS-CoV-2 main protease generally lack cyano and secondary amide groups, as well as quaternary carbons (i.e., StsC, NssssC, and SssNH descriptors have values 0). Eight of the top ten most active molecules exhibit relatively high AATS7Z values (associated with heavy atoms), and four of the top five most potent inhibitors contain chlorine, contributing to high AATS7Z and EState_VSA9 values. To the best of our knowledge, our approach resulted in models that are the first QSAR models to highlight these significant features. Additionally, the models favor aromatic rings (as represented by descriptors like n10FaRing, naHRing, and NaaaC), especially those with heteroatoms. However, this is under specific constraints, particularly from descriptors such as GATS5c and GATS2are, along with PEOE_VSA10, PEOE_VSA6, SMR_VSA6, and EState_VSA4. In summary, molecules featuring larger aromatic structures, chlorine or other heavy atoms, and multiple nitrogen-containing rings appear to be prominent inhibitors. Since each highly active molecule in the dataset can be categorized by these characteristics, combining these structural elements may lead to the synthesis of even more effective SARS-CoV-2 main protease inhibitors in the future.
2.5. Generalizability and Impact
The findings of this study, particularly the success of 2D descriptors and OLS regression, have broad implications for data-driven decision-making beyond QSAR. In any field dealing with high-dimensional data, the trade-off between model complexity and interpretability is a crucial consideration. Our work demonstrates that simpler, more interpretable models can not only perform comparably to complex ones, but can also be more valuable in practice by providing clear, actionable insights to decision-makers. By providing a robust and reproducible methodology for feature selection that empowers such models, this study offers a blueprint for building transparent, efficient, and trustworthy predictive systems in a variety of domains.
3. Materials and Methods
3.1. Datasets
Chemical structures exhibiting inhibitory activity against , used to train and test models, were sourced from the CHEMBL database (target ID = CHEMBL3927) [35]. To the best of our knowledge there is no explicit use of S9 metabolic activation in the original assays in the ChEMBL data. The half maximal inhibitory concentration ( ) values of all 86 molecules were converted to values using the equation following
and utilized as target values (factor 9 was taken since are given in nM). The list of molecules used, together with their activities, is presented in Supporting Information File S1. Given the dataset’s wide range of activities spanning more than 4 log units, including molecules with activity at the nano-molar level, it can be considered suitable for developing QSAR models aimed at facilitating the discovery of novel drugs. Additionally, 105 molecules were gathered from the literature [6,7,8,9,10,11,12,13,14,15,16] and used as an external dataset both have higher heterogeneity and test the applicability of the models to predict the values of novel compounds (Supporting Information File S2).
Molecular structures, initially represented as SMILES strings, were processed using RDKit version 2023.9.5 for SMILES parsing, salt removal (retaining the largest organic fragment), functional group normalization, and stereochemistry assignment. Duplicate entries were removed when multiple values were not reported for the same compound. Open Babel version 3.1.1 was employed for tautomer standardization, protonation state adjustment to physiological pH , and generation of three-dimensional structures. All molecules were visually inspected, and the results of tautomer standardization and protonation state assignment were verified using MarvinSketch version 24.3.2, which applies ChemAxon’s standardization engine. Conformational sampling was performed within Open Babel version 3.1.1 using the MMFF94 force field, with up to 50 conformations generated per molecule. Energy minimization was applied, and the lowest-energy conformer was selected for further analysis.
3.2. Molecular Descriptors
In this study, the Mordred 2D molecular descriptors [36] and two quantum-chemical descriptors from [16] were utilized. Hydrogens were treated as implicit, by default, in the calculation of Mordred 2D descriptors. The structure was optimized at (on Br and I atoms) level of theory, at which quantum-chemical descriptors, and , were also obtained. Vibrational frequency analysis was conducted to ensure the identification of minima on potential surfaces. DFT calculations were carried out using the polarizable continuum solvation model, with water as the solvent.
3.3. Selection of Descriptors and Modeling
A schematic representation of the model development process is provided in Figure 6. Descriptors containing missing values or non-numerical entries were omitted from the dataset in the data cleaning step. The data were then divided into training (80%) and test sets (20%) using the Kennard–Stone algorithm and normalized using the MinMaxScaler. The Kennard–Stone algorithm provides uniform and representative coverage of the descriptor space by selecting samples that are maximally spread across the chemical space, thereby reducing the risk of extrapolation within the applicability domain. It is deterministic and fully reproducible, ensuring the same split is obtained for a given dataset and distance metric. Moreover, it is particularly suitable for small to medium-sized datasets where random splitting may lead to unbalanced or poorly representative divisions. The MinMaxScaler was fitted exclusively on the training set. Obtained scaling parameters were subsequently applied to the test and external sets to prevent data leakage. Additionally, descriptors with a standard deviation less than 0.05 in the training set were excluded from the analysis. The final descriptor selection process was carried out using two different approaches.
In the initial step of the first approach, the set of descriptors was filtered to exclude any pairs exhibiting a pairwise correlation higher than 0.95. Specifically, descriptors with higher cumulative correlation coefficients were removed from consideration. Subsequently, a genetic algorithm (GA), as well as SequentialFeatureSelector, Recursive Feature Elimination (RFE), and SelectKBest algorithms (available in the sklearn module) were employed. In the case of GA, specific parameters were: population size = 100, mutation rate = 20, and number of generations = 500. SequentialFeatureSelector was used in both backward and forward directions, while SelectKBest was applied with two scoring functions, mutual_info_regression, and f_regression. The remaining parameters were kept at default values.
For the second approach, the FeatureWiz algorithm was applied, followed by stepwise feature selection following the approach applied in [21]. The FeatureWiz (version 0.5.7) configuration was as default except for corr_limit. The impact of different thresholds for acceptable pairwise correlations between descriptors was also explored. corr_limit values of 0.85 and 0.99 were set to assess their influence on the selection process.
Coefficients of determination in leave-one-out cross-validation, , calculated for OLS regression, were used as the fitness function in both approaches. The ratio between the training set size and the number of descriptors was maintained at a minimum of 5. In other words, modelling was conducted using up to 14 descriptors.
Various linear regression models were employed for modelling: OLS, Ridge, Lasso (Least Absolute Shrinkage and Selection Operator), LassoLars (Lasso fitted with Least Angle Regression), Bayesian Ridge, ARD (Automatic Relevance Determination), and Linear SVR (Linear Support Vector Regression). Hyperparameters alpha (in the case of Ridge, Lasso, and LassoLars), alpha1, alpha2, lambda1, and lambda2 (in the case of Bayesian Ridge and ARD), and C (in the case of Linear SVR) were tuned in 5-fold cross-validation. The remaining parameters were kept at default values.
3.4. Methods for Inspection of Chemical Space
To delineate the applicability domain (AD), the leverage approach was employed, utilizing the Williams plot, which juxtaposes standard residuals against leverage (h). The h was calculated using the “hat” matrix formula , incorporating the training set descriptor matrix (X) and descriptor vector ( ) for each molecule. A leverage threshold ( ) was then established based on the number of descriptors (p) and the size of the training set (n), identifying molecules with as X-outliers. The value of was used. Additionally, molecules surpassing a standardized residual critical value, typically set at 3, were deemed Y-outliers [37]. Furthermore, principal component analysis (PCA) was utilized to gain deeper insights into chemical space. Additionally, a plot of molecular weight (MW) against the Wildman-Crippen partition coefficient (SlogP) was employed to explore chemical space further.
3.5. Evaluation of Models
Model evaluation entailed a comprehensive examination, employing a diverse array of commonly utilized statistical metrics as outlined in the literature [23,38] and applied in [21], including root mean square error for the training set ( ), leave-one-out cross-validation ( ,) and test set ( ), adjusted coefficients of determination ( ), coefficients of determination for the training set ( ), leave-one-out ( ), and leave-many-out ( ) cross-validation, as well as three metrics derived from coefficients of determination , , and . These metrics quantify predictive performance on the test set using different normalizations of the prediction error relative to the training set mean , test set mean , or variance scaling , respectively. Moreover, concordance correlation coefficient in the training set ( ), leave-one-out validation ( ), and the test set ( ), F-value, mean absolute error ( ) in the test set, slopes of regression lines through the origin of predicted vs. experimental values (k), and experimental vs. predicted values ( ) were evaluated, along with , and , and . Acceptable models were considered those that meet the following criteria: (1) , (2) , (3) , (4) , (5) , (6) , (7) , (8) , and (9) [39].
To address multicollinearity and chance correlation, a three-fold selection process was implemented. Initially, the QUIK (Q Under Influence of K) rule was utilized [40]. Following literature recommendations, only models with a difference of over 0.05 between two K indices were retained [29]. Subsequently, the Variance Inflation Factor (VIF) was computed, and models with each descriptor having a VIF value below 5 were passed to final test [41]. Y-Scrambling was then employed 200 times to ensure the robustness of the selected models.
3.6. Selection of the Best Model
The optimal model selection process utilized Multi-Criteria Decision Making (MCDM), more precisely the weighted sum model, integrating all measures of goodness-of-fit, robustness, and predictivity from the previous subsection to rank models. Additionally, the number of descriptors was factored in, ensuring a balanced assessment of model performance and simplicity. This dual approach ensures comprehensive evaluation, capturing both overall model performance via MCDM and the model’s complexity based on descriptor count.
3.7. Software
The modelling process was conducted using Python (v3.10.12) with standard libraries including NumPy (v1.25.2), pandas (v1.5.3), sklearn (v1.2.2), and statsmodels (v0.14.1), while visualization was facilitated by seaborn module (v0.13.1). Quantum-chemical descriptors were computed using the Gaussian 16 package [42]. Molecular descriptors were generated using the Mordred package (v1.2.0) [36]. Feature selection was executed utilizing the Featurewiz (v0.5.7) [43], genetic_selection (v0.6.0) [44], and sklearn modules. Preparation of data was done by RDKit 2023.9.5 [45], OpenBabel 3.1.1 [46], and MarvinSketch 24.3.2 [47].
4. Conclusions
In this study, we presented a robust QSAR modeling approach for identifying SARS-CoV-2 main protease inhibitors, emphasizing the value of interpretability and efficiency for data-driven decision-making. Our methodology, which combines the FeatureWiz algorithm with stepwise feature selection, successfully identified a set of 2D molecular descriptors that yield highly predictive models. The superior performance of OLS regression further underscores the importance of simple, transparent models for providing actionable insights. By demonstrating that a simpler approach can outperform more complex methods, we have shown that interpretability and computational efficiency are not merely desirable features but can be key drivers of a model’s success. The models presented here are not just predictive tools for QSAR; they represent a validated framework for data-driven decision-making that can be generalized to other fields where transparency and efficiency are critical for informed action.
To further advance this work and broaden its impact, we propose three distinct avenues for future research. First, exploration of advanced feature engineering. The current work utilized readily available two-dimensional descriptors. Future research should investigate the impact of incorporating more sophisticated feature engineering techniques, such as graph-based representations or molecular fingerprints derived from deep learning autoencoders. This will assess whether incremental gains in predictive power can be achieved without sacrificing the interpretability essential for drug discovery. Second, generalizability across therapeutic targets. The validated framework should be applied to a wider range of therapeutic targets beyond the SARS-CoV-2 main protease. Future studies will focus on testing the methodology’s generalizability and robustness against structurally diverse protein families and therapeutic areas to establish it as a universal standard for fast, interpretable QSAR modeling. Finally, third avenue would be the integration with High-Throughput Screening (HTS) pipelines and is the key practical outcome of this work. To maximize practical impact, the final models should be integrated into an automated, high-throughput computational screening pipeline. This will involve developing a user-friendly software module that can rapidly process large virtual libraries, provide immediate activity predictions, and flag compounds with optimal predicted properties and low synthetic complexity, thereby directly streamlining hit identification for experimental validation.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1World Health Organization COVID-19 Dashboard 2025 Available online: https://data.who.int/dashboards/covid 19(accessed on 2 December 2025)
- 2Ojha P.K. Kar S. Krishna J.G. Roy K. Leszczynski J. Therapeutics for COVID-19: From computation to practices—Where we are, where we are heading to Mol. Divers.20212562565910.1007/s 11030-020-10134-x 32880078 PMC 7467145 · doi ↗ · pubmed ↗
- 3Tarighi P. Eftekhari S. Chizari M. Sabernavaei M. Jafari D. Mirzabeigi P. A review of potential suggested drugs for coronavirus disease (COVID-19) treatment Eur. J. Pharmacol.202189517389010.1016/j.ejphar.2021.17389033482181 PMC 7816644 · doi ↗ · pubmed ↗
- 4Canal B. Fujisawa R. Mc Clure A.W. Deegan T.D. Wu M. Ulferts R. Weissmann F. Drury L.S. Bertolin A.P. Zeng J. Identifying SARS-Co V-2 antiviral compounds by screening for small molecule inhibitors of nsp 15 endoribonuclease Biochem. J.20214782465247910.1042/BCJ 2021019934198324 PMC 8286823 · doi ↗ · pubmed ↗
- 5Jin Z. Du X. Xu Y. Deng Y. Liu M. Zhao Y. Zhang B. Li X. Zhang L. Peng C. Structure of Mpro from SARS-Co V-2 and discovery of its inhibitors Nature 202058228929310.1038/s 41586-020-2223-y 32272481 · doi ↗ · pubmed ↗
- 6Kumar V. Roy K. Development of a simple, interpretable and easily transferable QSAR model for quick screening antiviral databases in search of novel 3C-like protease (3C Lpro) enzyme inhibitors against SARS-Co V diseases SAR QSAR Environ. Res.20203151152610.1080/1062936 X.2020.177638832543892 · doi ↗ · pubmed ↗
- 7Khan P.M. Kumar V. Roy K. In Silico Modeling of Small Molecule Carboxamides as Inhibitors of SARS-Co V 3CL Protease: An Approach Towards Combating COVID-19Comb. Chem. High Throughput Screen.2021241281129910.2174/138620732366620091409471232928086 · doi ↗ · pubmed ↗
- 8Adhikari N. Banerjee S. Baidya S.K. Ghosh B. Jha T. Ligand-based quantitative structural assessments of SARS-Co V-2 3C Lpro inhibitors: An analysis in light of structure-based multi-molecular modeling evidences J. Mol. Struct.2022125113204110.1016/j.molstruc.2021.13204134866654 PMC 8627846 · doi ↗ · pubmed ↗
