A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman; Roei Sarussi; Maor Ashkenazi; Ido kringel; Yaniv; Tocker; Tal Furman Shohet

arXiv:2501.00356·cs.LG·January 3, 2025

A New Dataset and Methodology for Malicious URL Classification

Ilan Schvartzman, Roei Sarussi, Maor Ashkenazi, Ido kringel, Yaniv, Tocker, Tal Furman Shohet

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces DeepURLBench, a comprehensive multi-class dataset for malicious URL classification, and enhances URLNet with DNS features to improve accuracy and real-time performance in cybersecurity.

Contribution

The paper presents a new multi-class dataset for malicious URLs and improves URLNet with DNS features, advancing real-time classification capabilities.

Findings

01

DeepURLBench outperforms existing datasets in quality and structure.

02

Enhanced URLNet with DNS features shows significant accuracy improvements.

03

Model maintains real-time efficiency with the proposed enhancements.

Abstract

Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

deepinstinct-algo/DeepURLBench
noneOfficial

Datasets

davanstrien/DeepURLBench
dataset· 155 dl
155 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection

MethodsUmbrella Reinforcement Learning