Detection of Criminal Texts for the Polish State Border Guard
Artur Nowakowski, Krzysztof Jassem

TL;DR
This paper presents a method for detecting criminal texts on the Polish Internet using fine-tuned transformer models, introducing a new dataset and benchmark for this task.
Contribution
It introduces a novel dataset and benchmark for criminal text detection in Polish, and demonstrates the effectiveness of fine-tuned transformer models for this task.
Findings
Fine-tuned Polish transformer models achieve high classification accuracy.
The collected dataset enables benchmarking of criminal text detection methods.
The approach handles unbalanced and noisy data effectively.
Abstract
This paper describes research on the detection of Polish criminal texts appearing on the Internet. We carried out experiments to find the best available setup for the efficient classification of unbalanced and noisy data. The best performance was achieved when our model was fine-tuned on a pre-trained Polish-based transformer language model. For the detection task, a large corpus of annotated Internet snippets was collected as training data. We share this dataset and create a new task for the detection of criminal texts using the Gonito platform as the benchmark.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
