Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification
Yujie Li, Yiwei Liu, Peiyue Li, Yifan Jia, Yanbin Wang

TL;DR
This paper introduces urlBERT, a specialized Transformer-based pre-trained model for URLs, designed to improve malicious URL detection and webpage classification through multi-task learning and novel training strategies.
Contribution
The paper presents urlBERT, a domain-specific pre-trained URL encoder with multi-level pretraining tasks and a grouped sequential learning method for effective multi-task fine-tuning.
Findings
urlBERT outperforms standard models in downstream tasks.
Multi-task mode effectively handles multiple URL-related tasks.
Proposed training strategies improve stability and performance.
Abstract
Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Misinformation and Its Impacts
MethodsContrastive Learning
