DarkDiff: Explainable web page similarity of TOR onion sites
Pieter Hartel, Eljo Haspels, Mark van Staalduinen, Octavio Texeira

TL;DR
DarkDiff is an explainable method for detecting near-duplicate web pages on the Darkweb, providing reasons for similarity, unlike traditional black-box approaches like MinHash.
Contribution
DarkDiff introduces an explainable near-duplicate detection technique tailored for Darkweb homepages, enhancing interpretability over existing black-box methods.
Findings
Effective detection of Darkweb homepage near-duplicates
Provides reasons for near-duplicate classification
Works well on Darkweb pages resembling the clear web
Abstract
In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Spam and Phishing Detection · Authorship Attribution and Profiling
