URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection
Hung Le, Quang Pham, Doyen Sahoo, Steven C.H. Hoi

TL;DR
URLNet is a deep learning framework that automatically learns URL representations for malicious URL detection, overcoming limitations of manual feature engineering and capturing semantic patterns in URLs.
Contribution
It introduces an end-to-end CNN-based model that learns URL embeddings directly from characters and words, improving detection accuracy over traditional methods.
Findings
Significant performance improvement over existing methods
Effective capture of semantic information in URLs
Robustness to unseen URL features
Abstract
Malicious URLs host unsolicited content and are used to perpetrate cybercrimes. It is imperative to detect them in a timely manner. Traditionally, this is done through the usage of blacklists, which cannot be exhaustive, and cannot detect newly generated malicious URLs. To address this, recent years have witnessed several efforts to perform Malicious URL Detection using Machine Learning. The most popular and scalable approaches use lexical properties of the URL string by extracting Bag-of-words like features, followed by applying machine learning models such as SVMs. There are also other features designed by experts to improve the prediction performance of the model. These approaches suffer from several limitations: (i) Inability to effectively capture semantic meaning and sequential patterns in URL strings; (ii) Requiring substantial manual feature engineering; and (iii) Inability to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Misinformation and Its Impacts · Advanced Malware Detection Techniques
