# Explainable phishing website detection for secure and sustainable cyber infrastructure

**Authors:** Tanzila Kehkashan, Maha Abdelhaq, Ahmad Sami Al-Shamayleh, Nazish Huda, Imran Ashraf Yaseen, Abdelmuttlib Ibrahim Abdalla Ahmed, Adnan Akhunzada

PMC · DOI: 10.1038/s41598-025-27984-w · 2025-11-25

## TL;DR

This paper proposes an explainable phishing detection system using machine learning and SHAP to improve accuracy and interpretability for secure cyber infrastructure.

## Contribution

The novelty lies in using SHAP-based feature selection with URL-based models for interpretable and accurate phishing detection.

## Key findings

- The random forest model achieved 97% accuracy in phishing detection.
- SHAP improved model interpretability by highlighting important URL-based features.
- The proposed system is efficient and suitable for resource-constrained devices.

## Abstract

Phishing is a social engineering attack and a type of cybercrime that is dangerously and constantly on the rise. Phishing attacks can impact various sectors, including governmental, social, financial, and individual businesses. Traditional methods of identifying phishing websites, such as blacklist and heuristic approaches, often fail to provide sufficient protection. Moreover, traditional techniques that combine URLs, webpage content, and external features are time-consuming, require substantial computing power, and are unsuitable for devices with limited resources. Moreover, previous research has often overlooked the critical role of identifying which features are important for detection and their impact on outcomes. Traditional methods might not fully capture the significance of individual features. To overcome this issue, this research applies feature selection techniques, specifically shapley additive explanations, with each model based primarily on the URL to improve the detection process. A dataset with over 11000+ URLs and 30 varied features of the ”Phishing Website Detection” was applied from the Kaggle repository. Then, the models, namely support vector machine (SVM), random forest (RF), decision tree (DT), logistic regression(LR), and K-nearest neighbor, were trained and tested. Each model used shapely additive explanations (SHAP) to improve precision and interpretability by highlighting the most important features. It was tested using some key performance metrics such as accuracy, precision, recall, and F1 score. Compared to all the models that were tested, this random forest model indicates 97% accuracy. The proposed system offers an overall and interpretable solution for phishing detection that contributes to a safer digital environment.

## Full-text entities

- **Genes:** SHROOM4 (shroom family member 4) [NCBI Gene 57477] {aka MRXSSDS, SHAP, shrm4}
- **Diseases:** plant (MESH:D010939), DT (MESH:D020195), XAI (MESH:C538243), DL (MESH:D007859)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12647635/full.md

---
Source: https://tomesphere.com/paper/PMC12647635