# A Data-Driven Approach for Interpretable and Efficient Predictive Modeling: A Case Study in SARS-CoV-2 Protease Inhibitor Discovery Through Feature Selection

**Authors:** Branislav Stanković, Sang-Yong Oh, Dušan Ramljak

PMC · DOI: 10.3390/ph19030498 · Pharmaceuticals · 2026-03-18

## TL;DR

This paper introduces a reliable and efficient method for drug discovery, specifically for finding inhibitors of the SARS-CoV-2 protease, using interpretable predictive models.

## Contribution

The novel contribution is a validated framework combining FeatureWiz and stepwise selection for interpretable and efficient predictive modeling in drug discovery.

## Key findings

- Combining FeatureWiz with stepwise selection satisfies all evaluation criteria for chemoinformatic models.
- Two-dimensional descriptors with OLS regression achieved the best predictive performance.
- The framework provides transparent and computationally efficient models for biological activity prediction.

## Abstract

Background/Objectives: Feature selection approaches should satisfy all evaluation criteria required by state-of-the-art chemoinformatic models. Our aim is to develop a methodology that is robust, interpretable and computationally efficient. Methods: This study presents a robust methodology for developing highly interpretable and computationally efficient predictive models, with a specific application in the discovery of SARS-CoV-2 main protease inhibitors. We evaluated various descriptor selection procedures to identify a transparent and reproducible approach that provides actionable insights for data-driven decisions. The models were trained and tested using molecules from the CHEMBL database and further validated on an external set of compounds. Results: Our findings demonstrate that a recently proposed procedure, combining the FeatureWiz algorithm with stepwise feature selection, is the only approach that satisfies all evaluation criteria required by state-of-the-art chemoinformatic models. In particular, we found that models based on two-dimensional descriptors and Ordinary Least Squares regression achieved the best results. Conclusions: Our framework and the choices made offer significant advantages in a decision-making context due to their inherent interpretability and computational efficiency. Our derived models, benchmarked against those in the literature, serve as effective, transparent tools for the rapid and reliable prediction of biological activity, providing a validated framework for data-driven decisions in drug discovery and beyond.

## Linked entities

- **Diseases:** SARS-CoV-2 (MONDO:0100096)

## Full-text entities

- **Species:** Severe acute respiratory syndrome coronavirus 2 (no rank) [taxon 2697049]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC13029608/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC13029608/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/PMC13029608/full.md

---
Source: https://tomesphere.com/paper/PMC13029608