A Crawler Architecture for Harvesting the Clear, Social, and Dark Web for IoT-Related Cyber-Threat Intelligence
Paris Koloveas, Thanasis Chantzios, Christos Tryfonopoulos, Spiros, Skiadopoulos

TL;DR
This paper introduces a novel, open-source crawler architecture that efficiently harvests cyber-security information from the clear, social, and dark web, utilizing machine learning and language modeling to prioritize relevant data for cyber-threat intelligence.
Contribution
It presents a two-phase crawling architecture combining machine learning and language modeling, specifically designed for comprehensive cyber-threat intelligence gathering across web domains.
Findings
Effective data harvesting demonstrated with crowdsourced evaluation
Two-phase approach improves relevance of collected data
Open-source implementation facilitates adoption and adaptation
Abstract
The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information that -given the appropriate tools and methods-may be identified, crawled and subsequently leveraged to actionable cyber-threat intelligence. In this work, we focus on the information gathering task, and present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web. The proposed architecture adopts a two-phase approach to data harvesting. Initially a machine learning-based crawler is used to direct the harvesting towards websites of interest, while in the second phase state-of-the-art statistical language modelling techniques are used to represent the harvested information in a latent low-dimensional feature space and rank it based on its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
