Breaking Bad: Detecting malicious domains using word segmentation
Wei Wang, Kenneth Shirley

TL;DR
This paper presents a lexical analysis method using word segmentation to detect malicious domains in cellular networks, improving detection accuracy and interpretability with a lightweight, near-real-time approach.
Contribution
It introduces a novel use of word segmentation on domain names to enhance malicious domain detection, providing interpretable results and complementing existing methods.
Findings
Word segmentation improves detection accuracy.
The approach is interpretable and identifies common malicious words.
Method is suitable for near-real-time deployment.
Abstract
In recent years, vulnerable hosts and maliciously registered domains have been frequently involved in mobile attacks. In this paper, we explore the feasibility of detecting malicious domains visited on a cellular network based solely on lexical characteristics of the domain names. In addition to using traditional quantitative features of domain names, we also use a word segmentation algorithm to segment the domain names into individual words to greatly expand the size of the feature set. Experiments on a sample of real-world data from a large cellular network show that using word segmentation improves our ability to detect malicious domains relative to approaches without segmentation, as measured by misclassification rates and areas under the ROC curve. Furthermore, the results are interpretable, allowing one to discover (with little supervision or tuning required) which words are used…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Network Security and Intrusion Detection · Advanced Malware Detection Techniques
