DarkDiff: Explainable web page similarity of TOR onion sites

Pieter Hartel; Eljo Haspels; Mark van Staalduinen; Octavio Texeira

arXiv:2308.12134·cs.CR·August 24, 2023·1 cites

DarkDiff: Explainable web page similarity of TOR onion sites

Pieter Hartel, Eljo Haspels, Mark van Staalduinen, Octavio Texeira

PDF

Open Access

TL;DR

DarkDiff is an explainable method for detecting near-duplicate web pages on the Darkweb, providing reasons for similarity, unlike traditional black-box approaches like MinHash.

Contribution

DarkDiff introduces an explainable near-duplicate detection technique tailored for Darkweb homepages, enhancing interpretability over existing black-box methods.

Findings

01

Effective detection of Darkweb homepage near-duplicates

02

Provides reasons for near-duplicate classification

03

Works well on Darkweb pages resembling the clear web

Abstract

In large-scale data analysis, near-duplicates are often a problem. For example, with two near-duplicate phishing emails, a difference in the salutation (Mr versus Ms) is not essential, but whether it is bank A or B is important. The state-of-the-art in near-duplicate detection is a black box approach (MinHash), so one only knows that emails are near-duplicates, but not why. We present DarkDiff, which can efficiently detect near-duplicates while providing the reason why there is a near-duplicate. We have developed DarkDiff to detect near-duplicates of homepages on the Darkweb. DarkDiff works well on those pages because they resemble the clear web of the past.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Spam and Phishing Detection · Authorship Attribution and Profiling