Detection of Malicious Websites Using Machine Learning Techniques

Adebayo Oshingbesan; Courage Ekoh; Chukwuemeka Okobi; Aime Munezero,; Kagame Richard

arXiv:2209.09630·cs.CR·September 21, 2022

Detection of Malicious Websites Using Machine Learning Techniques

Adebayo Oshingbesan, Courage Ekoh, Chukwuemeka Okobi, Aime Munezero,, Kagame Richard

PDF

Open Access

TL;DR

This paper evaluates ten machine learning models for detecting malicious websites using lexical features, highlighting K-Nearest Neighbor's consistent performance across datasets and the challenges in feature generalization.

Contribution

It provides a comprehensive cross-dataset analysis of multiple ML models for malicious website detection, emphasizing model robustness and feature limitations.

Findings

01

K-Nearest Neighbor performs consistently well across datasets.

02

Other models like Random Forest and SVM outperform baseline models.

03

Lexical feature subsets do not generalize well across models or datasets.

Abstract

In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Network Security and Intrusion Detection · Misinformation and Its Impacts

MethodsLogistic Regression