DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification
Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada

TL;DR
This paper introduces DomURLs_BERT, a pre-trained BERT-based model designed to detect and classify malicious domains and URLs, outperforming existing models across various cybersecurity tasks.
Contribution
The paper presents a novel pre-trained BERT-based encoder specifically adapted for malicious domain and URL detection, trained on a large multilingual corpus, and demonstrates its superior performance over existing models.
Findings
Outperforms state-of-the-art character-based deep learning models.
Effective across multiple classification tasks including phishing and malware.
Pre-trained model and datasets are publicly available.
Abstract
Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Layer Normalization · Attention Is All You Need · WordPiece · Dropout · Attention Dropout · Dense Connections · Residual Connection · Linear Layer
