A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

Navid Yazdanjue; Morteza Rakhshaninejad; Hossein Yazdanjouei; Mohammad Sadegh Khorshidi; Mikko S. Niemela; Fang Chen; Amir H. Gandomi

arXiv:2507.22912·cs.CL·August 1, 2025

A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

Navid Yazdanjue, Morteza Rakhshaninejad, Hossein Yazdanjouei, Mohammad Sadegh Khorshidi, Mikko S. Niemela, Fang Chen, Amir H. Gandomi

PDF

Open Access

TL;DR

This paper introduces a hierarchical semi-supervised ensemble framework utilizing fine-tuned language models and engineered features to detect and classify illicit marketplace content across diverse online platforms, addressing data scarcity and linguistic variability.

Contribution

It presents a novel multi-source detection framework combining ModernBERT embeddings with ensemble learning and engineered features for improved illicit content classification.

Findings

01

Outperforms baseline models like BERT and Longformer.

02

Achieves high accuracy (0.96489) and F1-score (0.93467) on multiple datasets.

03

Demonstrates robustness with limited supervision and diverse data sources.

Abstract

Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCybercrime and Law Enforcement Studies · Spam and Phishing Detection · Crime, Illicit Activities, and Governance