BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content   from Wykop.pl web service

Anna Ko{\l}os; Inez Okulska; Kinga G{\l}\k{a}bi\'nska; Agnieszka; Karli\'nska; Emilia Wi\'snios; Pawe{\l} Ellerik; Andrzej Pra{\l}at

arXiv:2308.10592·cs.CL·March 27, 2024·1 cites

BAN-PL: a Novel Polish Dataset of Banned Harmful and Offensive Content from Wykop.pl web service

Anna Ko{\l}os, Inez Okulska, Kinga G{\l}\k{a}bi\'nska, Agnieszka, Karli\'nska, Emilia Wi\'snios, Pawe{\l} Ellerik, Andrzej Pra{\l}at

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces BAN-PL, a new open Polish dataset of social media posts from Wykop.pl, aimed at improving NLP-based moderation tools by providing real-world, annotated harmful and neutral content with detailed analysis and anonymization procedures.

Contribution

It presents the first large-scale, publicly available Polish dataset of social media content for offensive content detection, including a thorough analysis of linguistic features and moderation biases.

Findings

01

The dataset contains 691,662 posts and comments, balanced between harmful and neutral categories.

02

An anonymized subset of 24,000 pieces is publicly available for research.

03

The paper discusses biases and content characteristics relevant to moderation tasks.

Abstract

Since the Internet is flooded with hate, it is one of the main tasks for NLP experts to master automated online content moderation. However, advancements in this field require improved access to publicly available accurate and non-synthetic datasets of social media content. For the Polish language, such resources are very limited. In this paper, we address this gap by presenting a new open dataset of offensive social media content for the Polish language. The dataset comprises content from Wykop.pl, a popular online service often referred to as the "Polish Reddit", reported by users and banned in the internal moderation process. It contains a total of 691,662 posts and comments, evenly divided into two categories: "harmful" and "neutral" ("non-harmful"). The anonymized subset of the BAN-PL dataset consisting on 24,000 pieces (12,000 for each class), along with preprocessing scripts have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ziliat-nask/ban-pl
noneOfficial

Models

🤗
NASK-PIB/BANonymizer-PL
model· 129 dl· ♡ 2
129 dl♡ 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection

Methodstravel james