A Graph-based Stratified Sampling Methodology for the Analysis of (Underground) Forums
Giorgio Di Tizio, Gilberto Atondo Siu, Alice Hutchings, Fabio Massacci

TL;DR
This paper introduces a graph-based stratified sampling methodology for selecting forum posts for annotation, improving recall in criminal activity detection while maintaining precision, thus optimizing supervised learning in underground forum analysis.
Contribution
It proposes a novel stratified sampling approach based on centrality metrics to enhance classifier performance in underground forum analysis.
Findings
Sampling from uniform degree centrality increases recall by 30%.
Classifiers trained on similar samples disagree up to 33% on criminal activity detection.
The methodology maintains precision while significantly boosting recall.
Abstract
[Context] Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. [Goal] Human annotation is costly. How to select samples to annotate that account for the structure of the forum? [Method] We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. [Result] We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
