PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction

Felipe Casta\~no; Eduardo Fidalgo; Enrique Alegre; Rocio Alaiz-Rodr\'iguez; Raul Orduna; Francesco Zola

arXiv:2506.21106·cs.CR·June 27, 2025

PhishKey: A Novel Centroid-Based Approach for Enhanced Phishing Detection Using Adaptive HTML Component Extraction

Felipe Casta\~no, Eduardo Fidalgo, Enrique Alegre, Rocio Alaiz-Rodr\'iguez, Raul Orduna, Francesco Zola

PDF

Open Access

TL;DR

PhishKey is an innovative phishing detection system that combines character-level URL analysis with HTML content extraction, utilizing CNNs and centroid-based methods to improve accuracy, robustness, and resistance to adversarial attacks.

Contribution

The paper introduces PhishKey, a hybrid approach integrating CNN-based URL classification with centroid-based HTML component extraction for enhanced phishing detection.

Findings

01

Achieves up to 98.70% F1 Score on multiple datasets

02

Demonstrates strong resistance to adversarial injection attacks

03

Provides a robust, efficient detection method combining multiple features

Abstract

Phishing attacks pose a significant cybersecurity threat, evolving rapidly to bypass detection mechanisms and exploit human vulnerabilities. This paper introduces PhishKey to address the challenges of adaptability, robustness, and efficiency. PhishKey is a novel phishing detection method using automatic feature extraction from hybrid sources. PhishKey combines character-level processing with Convolutional Neural Networks (CNN) for URL classification, and a Centroid-Based Key Component Phishing Extractor (CAPE) for HTML content at the word level. CAPE reduces noise and ensures complete sample processing avoiding crop operations on the input data. The predictions from both modules are integrated using a soft-voting ensemble to achieve more accurate and reliable classifications. Experimental evaluations on four state-of-the-art datasets demonstrate the effectiveness of PhishKey. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Web Data Mining and Analysis

MethodsUmbrella Reinforcement Learning