HTMLPhish: Enabling Phishing Web Page Detection by Applying Deep Learning Techniques on HTML Analysis
Chidimma Opara, Bo Wei, and Yingke Chen

TL;DR
HTMLPhish introduces a deep learning approach using CNNs to automatically detect phishing web pages from HTML content, achieving high accuracy and language independence without manual feature engineering.
Contribution
The paper presents HTMLPhish, a novel CNN-based method for phishing detection that leverages HTML content embeddings and manages language variability effectively.
Findings
Over 93% accuracy on a large dataset
Language-independent detection capability
Effective handling of new features through combined embeddings
Abstract
Recently, the development and implementation of phishing attacks require little technical skills and costs. This uprising has led to an ever-growing number of phishing attacks on the World Wide Web. Consequently, proactive techniques to fight phishing attacks have become extremely necessary. In this paper, we propose HTMLPhish, a deep learning based data-driven end-to-end automatic phishing web page classification approach. Specifically, HTMLPhish receives the content of the HTML document of a web page and employs Convolutional Neural Networks (CNNs) to learn the semantic dependencies in the textual contents of the HTML. The CNNs learn appropriate feature representations from the HTML document embeddings without extensive manual feature engineering. Furthermore, our proposed approach of the concatenation of the word and character embeddings allows our model to manage new features and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
