Detection of Criminal Texts for the Polish State Border Guard

Artur Nowakowski; Krzysztof Jassem

arXiv:2108.10580·cs.CL·August 25, 2021·1 cites

Detection of Criminal Texts for the Polish State Border Guard

Artur Nowakowski, Krzysztof Jassem

PDF

Open Access

TL;DR

This paper presents a method for detecting criminal texts on the Polish Internet using fine-tuned transformer models, introducing a new dataset and benchmark for this task.

Contribution

It introduces a novel dataset and benchmark for criminal text detection in Polish, and demonstrates the effectiveness of fine-tuned transformer models for this task.

Findings

01

Fine-tuned Polish transformer models achieve high classification accuracy.

02

The collected dataset enables benchmarking of criminal text detection methods.

03

The approach handles unbalanced and noisy data effectively.

Abstract

This paper describes research on the detection of Polish criminal texts appearing on the Internet. We carried out experiments to find the best available setup for the efficient classification of unbalanced and noisy data. The best performance was achieved when our model was fine-tuned on a pre-trained Polish-based transformer language model. For the detection task, a large corpus of annotated Internet snippets was collected as training data. We share this dataset and create a new task for the detection of criminal texts using the Gonito platform as the benchmark.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling