Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification

Yujie Li; Yiwei Liu; Peiyue Li; Yifan Jia; Yanbin Wang

arXiv:2402.11495·cs.CR·May 27, 2025·2 cites

Continuous Multi-Task Pre-training for Malicious URL Detection and Webpage Classification

Yujie Li, Yiwei Liu, Peiyue Li, Yifan Jia, Yanbin Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces urlBERT, a specialized Transformer-based pre-trained model for URLs, designed to improve malicious URL detection and webpage classification through multi-task learning and novel training strategies.

Contribution

The paper presents urlBERT, a domain-specific pre-trained URL encoder with multi-level pretraining tasks and a grouped sequential learning method for effective multi-task fine-tuning.

Findings

01

urlBERT outperforms standard models in downstream tasks.

02

Multi-task mode effectively handles multiple URL-related tasks.

03

Proposed training strategies improve stability and performance.

Abstract

Malicious URL detection and webpage classification are critical tasks in cybersecurity and information management. In recent years, extensive research has explored using BERT or similar language models to replace traditional machine learning methods for detecting malicious URLs and classifying webpages. While previous studies show promising results, they often apply existing language models to these tasks without accounting for the inherent differences in domain data (e.g., URLs being loosely structured and semantically sparse compared to text), leaving room for performance improvement. Furthermore, current approaches focus on single tasks and have not been tested in multi-task scenarios. To address these challenges, we propose urlBERT, a pre-trained URL encoder leveraging Transformer to encode foundational knowledge from billions of unlabeled URLs. To achieve it, we propose to use 5…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidup1/urlbert
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpam and Phishing Detection · Misinformation and Its Impacts

MethodsContrastive Learning