A feature-engineered dataset of benign and phishing URLs for machine learning and large language models evaluation

Dam Minh Linh; Tran Cong Hung

PMC · DOI:10.1016/j.dib.2025.112162·October 10, 2025

A feature-engineered dataset of benign and phishing URLs for machine learning and large language models evaluation

Dam Minh Linh, Tran Cong Hung

PDF

Open Access

TL;DR

This paper introduces a feature-rich dataset of 111,660 URLs labeled as benign or phishing, enabling better evaluation of machine learning and large language models for cybersecurity.

Contribution

The paper provides a curated, feature-engineered dataset for phishing detection with reproducible benchmarks for ML and LLM models.

Findings

01

The dataset includes 22 numerical features and 26 total columns for URL-based phishing detection.

02

Baseline models achieved over 96% accuracy and ROC AUC scores above 0.99.

03

The dataset supports reproducible benchmarks and future research on adversarial robustness.

Abstract

Phishing websites remain a major cybersecurity threat, yet the availability of balanced and feature-rich datasets for evaluating detection models is still limited. While machine learning (ML) and large language models (LLMs) have shown strong potential in URL-based classification, most public datasets provide raw URLs without feature engineering, making reproducibility and fair comparison across models difficult. To address this gap, we present a curated dataset of 111,660 URLs, consisting of 100,000 benign samples (label 0) and 11,660 phishing samples (label 1). Each URL entry is enriched with 22 numerical lexical and structural features (e.g., URL length, domain length, digit ratio, entropy, HTTPS usage). Additionally, three string reference columns (URL, domain, TLD) are preserved for interpretability, and one label column (0 = benign, 1 = phishing), totaling 26 columns. To…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Diseases2

ML RF

Mutations1

C68A

Figures8

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies