BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service
Anna Ko{\l}os, Inez Okulska, Kinga G{\l}\k{a}bi\'nska, Agnieszka, Karli\'nska, Emilia Wi\'snios, Pawe{\l} Ellerik, Andrzej Pra{\l}at

TL;DR
This paper introduces BAN-PL, a new open Polish dataset of social media posts from Wykop.pl, aimed at improving NLP-based moderation tools by providing real-world, annotated harmful and neutral content with detailed analysis and anonymization procedures.
Contribution
It presents the first large-scale, publicly available Polish dataset of social media content for offensive content detection, including a thorough analysis of linguistic features and moderation biases.
Findings
The dataset contains 691,662 posts and comments, balanced between harmful and neutral categories.
An anonymized subset of 24,000 pieces is publicly available for research.
The paper discusses biases and content characteristics relevant to moderation tasks.
Abstract
Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the "Polish Reddit", reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: "harmful" and "neutral" ("non-harmful"). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
Methodstravel james
