Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study
Abdelhakim Hannousse, Salima Yahiouche

TL;DR
This study develops a reproducible dataset scheme for website phishing detection, evaluates various features and classifiers, and finds hybrid features with filter-based selection yield the best accuracy of 96.83%.
Contribution
It introduces a systematic scheme for building extensible phishing datasets and evaluates feature and classifier combinations for improved detection performance.
Findings
Random Forest is the most predictive classifier.
External service features are most discriminative.
Hybrid features achieve 96.61% accuracy.
Abstract
In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems using different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined classification of website phishing features and we systematically select a total of 87 commonly recognized ones, we classify them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual classes and on combinations of classes, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
