DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Abdelkader El Mahdaouy; Salima Lamsiyah; Meryem Janati Idrissi; Hamza Alami; Zakaria Yartaoui; Ismail Berrada

arXiv:2409.09143·cs.CR·February 6, 2026

DomURLs_BERT: Pre-trained BERT-based Model for Malicious Domains and URLs Detection and Classification

Abdelkader El Mahdaouy, Salima Lamsiyah, Meryem Janati Idrissi, Hamza Alami, Zakaria Yartaoui, Ismail Berrada

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces DomURLs_BERT, a pre-trained BERT-based model designed to detect and classify malicious domains and URLs, outperforming existing models across various cybersecurity tasks.

Contribution

The paper presents a novel pre-trained BERT-based encoder specifically adapted for malicious domain and URL detection, trained on a large multilingual corpus, and demonstrates its superior performance over existing models.

Findings

01

Outperforms state-of-the-art character-based deep learning models.

02

Effective across multiple classification tasks including phishing and malware.

03

Pre-trained model and datasets are publicly available.

Abstract

Detecting and classifying suspicious or malicious domain names and URLs is fundamental task in cybersecurity. To leverage such indicators of compromise, cybersecurity vendors and practitioners often maintain and update blacklists of known malicious domains and URLs. However, blacklists frequently fail to identify emerging and obfuscated threats. Over the past few decades, there has been significant interest in developing machine learning models that automatically detect malicious domains and URLs, addressing the limitations of blacklists maintenance and updates. In this paper, we introduce DomURLs_BERT, a pre-trained BERT-based encoder adapted for detecting and classifying suspicious/malicious domains and URLs. DomURLs_BERT is pre-trained using the Masked Language Modeling (MLM) objective on a large multilingual corpus of URLs, domain names, and Domain Generation Algorithms (DGA)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AbdelkaderMH/DomURLs_BERT
pytorchOfficial

Models

🤗
amahdaouy/DomURLs_BERT
model· 1.6k dl· ♡ 2
1.6k dl♡ 2

Datasets

amahdaouy/Web_DomURLs
dataset· 206 dl
206 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Layer Normalization · Attention Is All You Need · WordPiece · Dropout · Attention Dropout · Dense Connections · Residual Connection · Linear Layer