Towards Benchmark Datasets for Machine Learning Based Website Phishing   Detection: An experimental study

Abdelhakim Hannousse; Salima Yahiouche

arXiv:2010.12847·cs.CR·April 24, 2024

Towards Benchmark Datasets for Machine Learning Based Website Phishing Detection: An experimental study

Abdelhakim Hannousse, Salima Yahiouche

PDF

1 Datasets

TL;DR

This study develops a reproducible dataset scheme for website phishing detection, evaluates various features and classifiers, and finds hybrid features with filter-based selection yield the best accuracy of 96.83%.

Contribution

It introduces a systematic scheme for building extensible phishing datasets and evaluates feature and classifier combinations for improved detection performance.

Findings

01

Random Forest is the most predictive classifier.

02

External service features are most discriminative.

03

Hybrid features achieve 96.61% accuracy.

Abstract

In this paper, we present a general scheme for building reproducible and extensible datasets for website phishing detection. The aim is to (1) enable comparison of systems using different features, (2) overtake the short-lived nature of phishing websites, and (3) keep track of the evolution of phishing tactics. For experimenting the proposed scheme, we start by adopting a refined classification of website phishing features and we systematically select a total of 87 commonly recognized ones, we classify them, and we made them subjects for relevance and runtime analysis. We use the collected set of features to build a dataset in light of the proposed scheme. Thereafter, we use a conceptual replication approach to check the genericity of former findings for the built dataset. Specifically, we evaluate the performance of classifiers on individual classes and on combinations of classes, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

pirocheto/phishing-url
dataset· 330 dl
330 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.