Cybersecurity Data Extraction from Common Crawl
Ashim Mahara

TL;DR
This paper introduces Alpha-Root, a cybersecurity dataset derived from the Common Crawl web graph using community detection, enabling direct mining of quality domains from minimal seed sources.
Contribution
It presents a novel method for extracting cybersecurity data by leveraging community detection on the web graph, avoiding iterative content scoring.
Findings
Successfully mined quality domains from the web graph
Achieved efficient extraction starting from only 20 seed domains
Demonstrated effectiveness compared to traditional iterative methods
Abstract
Alpha-Root is a cybersecurity-focused dataset collected in a single shot from the Common Crawl web graph using community detection. Unlike iterative content-scoring approaches like DeepSeekMath, we mine quality domains directly from the web graph, starting from just 20 trusted seed domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCybercrime and Law Enforcement Studies · Spam and Phishing Detection · Authorship Attribution and Profiling
