A Big Data Architecture for Early Identification and Categorization of Dark Web Sites
Javier Pastor-Galindo, H\^ong-\^An Sandlin, F\'elix G\'omez M\'armol,, G\'er\^ome Bovet, Gregorio Mart\'inez P\'erez

TL;DR
This paper presents a scalable big data architecture for early detection, analysis, and categorization of dark web sites, using open source tools to monitor and analyze Tor sites efficiently.
Contribution
It introduces an end-to-end scalable system leveraging open source big data tools for automated discovery, content deduplication, and topic categorization of dark web sites.
Findings
Identified 80,049 onion sites in 93 days.
Characterized 90% of discovered sites.
Detected extensive content duplication and phishing networks.
Abstract
The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Cybercrime and Law Enforcement Studies · Spam and Phishing Detection
