Cybersecurity Data Extraction from Common Crawl

Ashim Mahara

arXiv:2602.22218·cs.CR·February 27, 2026

Cybersecurity Data Extraction from Common Crawl

Ashim Mahara

PDF

Open Access

TL;DR

This paper introduces Alpha-Root, a cybersecurity dataset derived from the Common Crawl web graph using community detection, enabling direct mining of quality domains from minimal seed sources.

Contribution

It presents a novel method for extracting cybersecurity data by leveraging community detection on the web graph, avoiding iterative content scoring.

Findings

01

Successfully mined quality domains from the web graph

02

Achieved efficient extraction starting from only 20 seed domains

03

Demonstrated effectiveness compared to traditional iterative methods

Abstract

Alpha-Root is a cybersecurity-focused dataset collected in a single shot from the Common Crawl web graph using community detection. Unlike iterative content-scoring approaches like DeepSeekMath, we mine quality domains directly from the web graph, starting from just 20 trusted seed domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Spam and Phishing Detection · Authorship Attribution and Profiling