A Big Data Architecture for Early Identification and Categorization of   Dark Web Sites

Javier Pastor-Galindo; H\^ong-\^An Sandlin; F\'elix G\'omez M\'armol,; G\'er\^ome Bovet; Gregorio Mart\'inez P\'erez

arXiv:2401.13320·cs.DC·January 25, 2024·1 cites

A Big Data Architecture for Early Identification and Categorization of Dark Web Sites

Javier Pastor-Galindo, H\^ong-\^An Sandlin, F\'elix G\'omez M\'armol,, G\'er\^ome Bovet, Gregorio Mart\'inez P\'erez

PDF

Open Access 1 Repo

TL;DR

This paper presents a scalable big data architecture for early detection, analysis, and categorization of dark web sites, using open source tools to monitor and analyze Tor sites efficiently.

Contribution

It introduces an end-to-end scalable system leveraging open source big data tools for automated discovery, content deduplication, and topic categorization of dark web sites.

Findings

01

Identified 80,049 onion sites in 93 days.

02

Characterized 90% of discovered sites.

03

Detected extensive content duplication and phishing networks.

Abstract

The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

javier-pg/dark-web-architecture
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Cybercrime and Law Enforcement Studies · Spam and Phishing Detection