Detection of Malicious Websites Using Machine Learning Techniques
Adebayo Oshingbesan, Courage Ekoh, Chukwuemeka Okobi, Aime Munezero,, Kagame Richard

TL;DR
This paper evaluates ten machine learning models for detecting malicious websites using lexical features, highlighting K-Nearest Neighbor's consistent performance across datasets and the challenges in feature generalization.
Contribution
It provides a comprehensive cross-dataset analysis of multiple ML models for malicious website detection, emphasizing model robustness and feature limitations.
Findings
K-Nearest Neighbor performs consistently well across datasets.
Other models like Random Forest and SVM outperform baseline models.
Lexical feature subsets do not generalize well across models or datasets.
Abstract
In detecting malicious websites, a common approach is the use of blacklists which are not exhaustive in themselves and are unable to generalize to new malicious sites. Detecting newly encountered malicious websites automatically will help reduce the vulnerability to this form of attack. In this study, we explored the use of ten machine learning models to classify malicious websites based on lexical features and understand how they generalize across datasets. Specifically, we trained, validated, and tested these models on different sets of datasets and then carried out a cross-datasets analysis. From our analysis, we found that K-Nearest Neighbor is the only model that performs consistently high across datasets. Other models such as Random Forest, Decision Trees, Logistic Regression, and Support Vector Machines also consistently outperform a baseline model of predicting every link as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Network Security and Intrusion Detection · Misinformation and Its Impacts
MethodsLogistic Regression
